Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite for v2 #68

Merged
merged 80 commits into from
Aug 15, 2022
Merged

Rewrite for v2 #68

merged 80 commits into from
Aug 15, 2022

Conversation

jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Feb 20, 2022

Why a breaking change?

Essentially, #63 is unsolvable without making a breaking change.

I figured, if we were to break the API anyway, there were several areas where FASTX could be made nicer.

Important changes

External

  • All relevant functions and methods are now exported from FASTX
  • All Record objects are now always filled, and valid records, and the isfilled method has been removed. This avoids many annoying edge cases.
  • All BioGenerics method have been removed, except the ones used for the readers/writers.
  • Definition of "description" has changed, and now refers to the whole header line except the initial '>' or @
  • Record objects always have identifiers, descriptions, sequences and (for FASTQ), qualities. These may be empty.
  • Identifiers, descriptions, sequences and qualities are returned as lazy StringViews. Sequences can still be returned as a user-specified type, if so specified. It defaults to StringView if not.
  • You can no longer construct a Record from a string. Instead, use parse(Record, str).
  • New function: quality_scores returns the qualities as a lazy, validating iterator of scores using a default QualityEncoding object to decode ASCII PHRED scores to quality scores
  • Getting the quality of a FASTQ Record now requires a QualityEncoding object, although a default one is used if not passed explicitly.
  • FASTA.Reader is now more liberal in what it accepts and will parse nearly anything.
  • Readers have a new keyword, copy, which defaults to true. If false, the reader will overwrite the same record on iteration. This makes the old while !eof(reader) idiom obsolete in favor of iterating over Reader(io; copy=false).
  • All reader and writers now use keyword arguments for all arguments except the IO, for clarity and consistency
  • Using the latest changes in BioGenerics, readers can now use the following syntax to make reading gzipped files easier:
Reader(GzipDecompressorStream(open(path)); kwargs...) do reader
    # stuff
end
  • FASTQ.Read has been removed.
  • QualityEncoding is now an exported, documented object.
  • FASTQ.Writers will no longer by default modify FASTQ.Records's second header. An optional keyword forces the reader to always write/skip second header if set to true or false, but it defaults to nothing.
  • FASTQ.Writers now can no longer fill in ambiguous bases in Records transparently, or otherwise transform records, when writing. If the user wishes to transform records, they must do it my manually calling a function that transforms the records.
  • transcribe has been removed, as it is now trivial to do the same thing.
  • FAI indices can now be written as well as read.
  • FAI indices can now be generated from FASTA files using the faidx function.
  • extract can now extract parts of sequences from indexed FASTA files without loading a whole record. E.g. if you have a whole chromosome, you can load just a few basepairs without loading the entire chromosome (see so slowly of extract sequence by coord #29)
  • New functions validate_fasta and validate_fastq to quickly and memory-efficiently check if a file is FASTX-formatted.

Internal

  • Complete refactor of tests, which is now more thorough and easier to change in the future (Refactor tests #37)
  • Better comments, I believe
  • Record objects are now indexed in the FSM itself, instead of in a second pass.
  • Record objects are now more lightweight, i.e. FASTQ.Record 88 -> 32 bytes, and stores less implicit data in the data vector.
  • Readers and records are now more optimized in general
  • FAI parser is now Automa-based and faster.

closes #77
closes #73
closes #37
closes #63
closes #29

@codecov
Copy link

codecov bot commented Feb 20, 2022

Codecov Report

Merging #68 (b828fc6) into master (f03bccf) will increase coverage by 5.89%.
The diff coverage is 91.24%.

❗ Current head b828fc6 differs from pull request most recent head c66d035. Consider uploading reports for the commit c66d035 to get more accurate results

@@            Coverage Diff             @@
##           master      #68      +/-   ##
==========================================
+ Coverage   84.39%   90.28%   +5.89%     
==========================================
  Files          12       11       -1     
  Lines         660      628      -32     
==========================================
+ Hits          557      567      +10     
+ Misses        103       61      -42     
Flag Coverage Δ
unittests 90.28% <91.24%> (+5.89%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/fasta/record.jl 82.14% <82.69%> (+0.89%) ⬆️
src/fasta/index.jl 85.31% <85.10%> (-14.69%) ⬇️
src/FASTX.jl 91.17% <91.04%> (-8.83%) ⬇️
src/fastq/reader.jl 75.86% <91.66%> (-13.50%) ⬇️
src/fasta/reader.jl 87.50% <92.20%> (-2.36%) ⬇️
src/fastq/record.jl 94.94% <94.68%> (+10.89%) ⬆️
src/fastq/quality.jl 95.00% <95.00%> (+9.81%) ⬆️
src/fasta/readrecord.jl 100.00% <100.00%> (+3.57%) ⬆️
src/fasta/writer.jl 100.00% <100.00%> (+3.70%) ⬆️
src/fastq/readrecord.jl 100.00% <100.00%> (+53.84%) ⬆️
... and 8 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jakobnissen
Copy link
Member Author

After a little thinking, I might

  • Make both non-empty identifiers and non-empty sequences mandatory for both FASTA and FASTQ
  • Encode the description as Union{Nothing, UnitRange{Int}} to be explicit about when it's empty.

@kescobo
Copy link
Member

kescobo commented Feb 20, 2022

I think empty identifiers are technically valid, eg

>
ATTGC
>
CCGAC

But I don't know if I would think of this as id == "" or ismissing(id).

A FASTA record without a sequence doesn't make any sense to me.

Copy link
Member

@CiaranOMara CiaranOMara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is neater, and I'm liking the StringViews.

I'm not sure about enforcing the presence of identifiers or sequences. But once a decision is made, it makes sense to remove checks for fields always present.

src/fasta/record.jl Outdated Show resolved Hide resolved
src/fasta/record.jl Show resolved Hide resolved
src/fasta/record.jl Outdated Show resolved Hide resolved
src/fastq/record.jl Outdated Show resolved Hide resolved
src/fasta/record.jl Outdated Show resolved Hide resolved
src/fastq/record.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
src/fastq/record.jl Outdated Show resolved Hide resolved
src/fastq/record.jl Outdated Show resolved Hide resolved
@CiaranOMara
Copy link
Member

But I don't know if I would think of this as id == "" or ismissing(id).

Either way, the end-user would still need to check and decide how to handle the information. I don't think reinterpreting the StringView does any favours.

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Mar 29, 2022

So, reading some webpages e.g.

here
here
here

I think we have an answer to this question regarding using missing or empty string returned by description(rec).

These pages seem to describe the entire '>' line as a [description|definition] line i.e. identifier + any additional info.

So, what if, we take that stance. We include identifier, as a convenience, and then, we have description() defined as giving you the entire description line - including the identifier.

So, for any record with just an ID and no additional info, will give you identifier(rec) == description(rec).

Any record with additional content after the identifier, then identifier(rec) != description(rec).

Then, because so many platforms have their own way of dealing with extra info - e.g. ncbi has the whole "[tag=value]" thing. We simply take the position "use description() to get the whole description line, parse it how you will, ya on ya own buddy."

Thus identifier becomes a subset of the description, and the behaviour of the two, is consistent.

@TransGirlCodes
Copy link
Member

@jakobnissen How do you feel about this proposal of doing away with description and just providing identifier and definition (essentially renamed header)?

@jakobnissen
Copy link
Member Author

That's a good idea. I like it. I'll implement the changes this week

@jakobnissen jakobnissen changed the title Deal better with missingness Rewrite for v2 Jul 28, 2022
@CiaranOMara CiaranOMara marked this pull request as draft August 4, 2022 23:10
@jakobnissen jakobnissen marked this pull request as ready for review August 5, 2022 16:23
@jakobnissen
Copy link
Member Author

jakobnissen commented Aug 5, 2022

@SabrinaJaye @kescobo and other interested parties:

This is now ready for review/test. There is too much code to review, but you can play around with it and see if you like how it feels, and if you approve of the changes described in the OP here. I recommend reading the new, updated documentation.

Now what is needed is just nice-to-haves, which can always be added later.

The only thing left to do here before tagging v2 is just to code coverage (I will take care of that), and if @SabrinaJaye have any ideas for high-level operations.

During the next week, I will finish up the last remaining tests, then in 1-2 weeks, I will squash merge this to master unless you have any comments, and then release FASTX v2.

@kescobo
Copy link
Member

kescobo commented Aug 5, 2022

Why does Documenter think you want to deploy via Travis.ci?
image

@kescobo
Copy link
Member

kescobo commented Aug 5, 2022

I think if you add push_preview=true to deploydocs() here, it should build a preview so we can view it online. See here.

@jakobnissen
Copy link
Member Author

jakobnissen commented Aug 9, 2022

@kescobo I tried to add previews, but apparently it's failing? :/ I can't figure out why. I added a new documenter key, but the build job claims it's not there or it's empty. Maybe it's acceptable that it doesn't work for PRs, I can look at it after pushing this to master.

@jakobnissen jakobnissen changed the base branch from release-2 to master August 9, 2022 09:37
* Bump BioSequences/BioSymbols to v3/v5

* Bump Julia version
Currently, `header`, `identifier` and `description` returns `String`, which forces needless allocations. This PR adds the dependency `StringViews`, which allows the creation of an `AbstractString` from any `AbstractVector{UInt8}`.

The aforementioned functions now return these string views backed by a view into the data buffer.
Rename FASTQ.FASTQRead to FASTQ.Read
@kescobo
Copy link
Member

kescobo commented Aug 9, 2022

I can look at it after pushing this to master.

Seems fine, I can try too. I'll build docs locally for now

@jakobnissen jakobnissen marked this pull request as draft August 10, 2022 18:22
@jakobnissen jakobnissen marked this pull request as ready for review August 13, 2022 10:19
@kescobo kescobo mentioned this pull request Aug 14, 2022
@jakobnissen jakobnissen merged commit f44c498 into BioJulia:master Aug 15, 2022
@jakobnissen jakobnissen deleted the missingness branch August 15, 2022 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants