Rewrite for v2 #68

jakobnissen · 2022-02-20T14:44:49Z

Why a breaking change?

Essentially, #63 is unsolvable without making a breaking change.

I figured, if we were to break the API anyway, there were several areas where FASTX could be made nicer.

Important changes

External

All relevant functions and methods are now exported from FASTX
All Record objects are now always filled, and valid records, and the isfilled method has been removed. This avoids many annoying edge cases.
All BioGenerics method have been removed, except the ones used for the readers/writers.
Definition of "description" has changed, and now refers to the whole header line except the initial '>' or @
Record objects always have identifiers, descriptions, sequences and (for FASTQ), qualities. These may be empty.
Identifiers, descriptions, sequences and qualities are returned as lazy StringViews. Sequences can still be returned as a user-specified type, if so specified. It defaults to StringView if not.
You can no longer construct a Record from a string. Instead, use parse(Record, str).
New function: quality_scores returns the qualities as a lazy, validating iterator of scores using a default QualityEncoding object to decode ASCII PHRED scores to quality scores
Getting the quality of a FASTQ Record now requires a QualityEncoding object, although a default one is used if not passed explicitly.
FASTA.Reader is now more liberal in what it accepts and will parse nearly anything.
Readers have a new keyword, copy, which defaults to true. If false, the reader will overwrite the same record on iteration. This makes the old while !eof(reader) idiom obsolete in favor of iterating over Reader(io; copy=false).
All reader and writers now use keyword arguments for all arguments except the IO, for clarity and consistency
Using the latest changes in BioGenerics, readers can now use the following syntax to make reading gzipped files easier:

Reader(GzipDecompressorStream(open(path)); kwargs...) do reader
    # stuff
end

FASTQ.Read has been removed.
QualityEncoding is now an exported, documented object.
FASTQ.Writers will no longer by default modify FASTQ.Records's second header. An optional keyword forces the reader to always write/skip second header if set to true or false, but it defaults to nothing.
FASTQ.Writers now can no longer fill in ambiguous bases in Records transparently, or otherwise transform records, when writing. If the user wishes to transform records, they must do it my manually calling a function that transforms the records.
transcribe has been removed, as it is now trivial to do the same thing.
FAI indices can now be written as well as read.
FAI indices can now be generated from FASTA files using the faidx function.
extract can now extract parts of sequences from indexed FASTA files without loading a whole record. E.g. if you have a whole chromosome, you can load just a few basepairs without loading the entire chromosome (see so slowly of extract sequence by coord #29)
New functions validate_fasta and validate_fastq to quickly and memory-efficiently check if a file is FASTX-formatted.

Internal

Complete refactor of tests, which is now more thorough and easier to change in the future (Refactor tests #37)
Better comments, I believe
Record objects are now indexed in the FSM itself, instead of in a second pass.
Record objects are now more lightweight, i.e. FASTQ.Record 88 -> 32 bytes, and stores less implicit data in the data vector.
Readers and records are now more optimized in general
FAI parser is now Automa-based and faster.

closes #77
closes #73
closes #37
closes #63
closes #29

codecov · 2022-02-20T14:46:27Z

Codecov Report

Merging #68 (b828fc6) into master (f03bccf) will increase coverage by 5.89%.
The diff coverage is 91.24%.

❗ Current head b828fc6 differs from pull request most recent head c66d035. Consider uploading reports for the commit c66d035 to get more accurate results

@@            Coverage Diff             @@
##           master      #68      +/-   ##
==========================================
+ Coverage   84.39%   90.28%   +5.89%     
==========================================
  Files          12       11       -1     
  Lines         660      628      -32     
==========================================
+ Hits          557      567      +10     
+ Misses        103       61      -42

Flag	Coverage Δ
unittests	`90.28% <91.24%> (+5.89%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/fasta/record.jl	`82.14% <82.69%> (+0.89%)`	⬆️
src/fasta/index.jl	`85.31% <85.10%> (-14.69%)`	⬇️
src/FASTX.jl	`91.17% <91.04%> (-8.83%)`	⬇️
src/fastq/reader.jl	`75.86% <91.66%> (-13.50%)`	⬇️
src/fasta/reader.jl	`87.50% <92.20%> (-2.36%)`	⬇️
src/fastq/record.jl	`94.94% <94.68%> (+10.89%)`	⬆️
src/fastq/quality.jl	`95.00% <95.00%> (+9.81%)`	⬆️
src/fasta/readrecord.jl	`100.00% <100.00%> (+3.57%)`	⬆️
src/fasta/writer.jl	`100.00% <100.00%> (+3.70%)`	⬆️
src/fastq/readrecord.jl	`100.00% <100.00%> (+53.84%)`	⬆️
... and 8 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jakobnissen · 2022-02-20T15:06:27Z

After a little thinking, I might

Make both non-empty identifiers and non-empty sequences mandatory for both FASTA and FASTQ
Encode the description as Union{Nothing, UnitRange{Int}} to be explicit about when it's empty.

kescobo · 2022-02-20T18:16:02Z

I think empty identifiers are technically valid, eg

>
ATTGC
>
CCGAC

But I don't know if I would think of this as id == "" or ismissing(id).

A FASTA record without a sequence doesn't make any sense to me.

CiaranOMara

Overall I think this is neater, and I'm liking the StringViews.

I'm not sure about enforcing the presence of identifiers or sequences. But once a decision is made, it makes sense to remove checks for fields always present.

src/fasta/record.jl

src/fastq/record.jl

src/fasta/record.jl

src/fastq/record.jl

test/runtests.jl

src/fastq/record.jl

CiaranOMara · 2022-02-20T22:56:15Z

But I don't know if I would think of this as id == "" or ismissing(id).

Either way, the end-user would still need to check and decide how to handle the information. I don't think reinterpreting the StringView does any favours.

TransGirlCodes · 2022-03-29T20:49:50Z

So, reading some webpages e.g.

here
here
here

I think we have an answer to this question regarding using missing or empty string returned by description(rec).

These pages seem to describe the entire '>' line as a [description|definition] line i.e. identifier + any additional info.

So, what if, we take that stance. We include identifier, as a convenience, and then, we have description() defined as giving you the entire description line - including the identifier.

So, for any record with just an ID and no additional info, will give you identifier(rec) == description(rec).

Any record with additional content after the identifier, then identifier(rec) != description(rec).

Then, because so many platforms have their own way of dealing with extra info - e.g. ncbi has the whole "[tag=value]" thing. We simply take the position "use description() to get the whole description line, parse it how you will, ya on ya own buddy."

Thus identifier becomes a subset of the description, and the behaviour of the two, is consistent.

TransGirlCodes · 2022-04-03T13:08:00Z

@jakobnissen How do you feel about this proposal of doing away with description and just providing identifier and definition (essentially renamed header)?

jakobnissen · 2022-04-04T09:14:04Z

That's a good idea. I like it. I'll implement the changes this week

jakobnissen · 2022-08-05T16:27:19Z

@SabrinaJaye @kescobo and other interested parties:

This is now ready for review/test. There is too much code to review, but you can play around with it and see if you like how it feels, and if you approve of the changes described in the OP here. I recommend reading the new, updated documentation.

Now what is needed is just nice-to-haves, which can always be added later.

The only thing left to do here before tagging v2 is just to code coverage (I will take care of that), and if @SabrinaJaye have any ideas for high-level operations.

During the next week, I will finish up the last remaining tests, then in 1-2 weeks, I will squash merge this to master unless you have any comments, and then release FASTX v2.

kescobo · 2022-08-05T20:17:22Z

Why does Documenter think you want to deploy via Travis.ci?

kescobo · 2022-08-05T20:20:41Z

I think if you add push_preview=true to deploydocs() here, it should build a preview so we can view it online. See here.

jakobnissen · 2022-08-09T07:07:01Z

@kescobo I tried to add previews, but apparently it's failing? :/ I can't figure out why. I added a new documenter key, but the build job claims it's not there or it's empty. Maybe it's acceptable that it doesn't work for PRs, I can look at it after pushing this to master.

* Bump BioSequences/BioSymbols to v3/v5 * Bump Julia version

Currently, `header`, `identifier` and `description` returns `String`, which forces needless allocations. This PR adds the dependency `StringViews`, which allows the creation of an `AbstractString` from any `AbstractVector{UInt8}`. The aforementioned functions now return these string views backed by a view into the data buffer.

Rename FASTQ.FASTQRead to FASTQ.Read

kescobo · 2022-08-09T14:49:44Z

I can look at it after pushing this to master.

Seems fine, I can try too. I'll build docs locally for now

jakobnissen mentioned this pull request Feb 20, 2022

Change fileld field of records to Int #69

Closed

CiaranOMara reviewed Feb 20, 2022

View reviewed changes

jakobnissen force-pushed the missingness branch from 6ec73c5 to 17f3dfc Compare July 26, 2022 07:38

jakobnissen changed the title ~~Deal better with missingness~~ Rewrite for v2 Jul 28, 2022

CiaranOMara marked this pull request as draft August 4, 2022 23:10

jakobnissen marked this pull request as ready for review August 5, 2022 16:23

jakobnissen changed the base branch from release-2 to master August 9, 2022 09:37

jakobnissen added 12 commits August 9, 2022 11:41

Remove autodetection of FASTA sequence type

17350e4

Disallow record construction from anything (BioJulia#60)

0204c8e

Update Julia requirement to 1.6 (BioJulia#64)

1121700

Export all user-facing functions

129a7c3

Bump BioSequences/BioSymbols to v3/v5 (BioJulia#66)

3fd61ff

* Bump BioSequences/BioSymbols to v3/v5 * Bump Julia version

Update docstrings of functions returning string views

fb35d5f

Fix tests

a85aaa2

Make identifier and header always return a string

0a74051

Rename FASTQ.FASTQRead to FASTQ.Read

Make description return Union{Nothing, StringView}

cfa37a8

Make sequence mandatory in FASTA

87906a7

Make sequence mandatory in FASTQ

1428cb1

jakobnissen added 9 commits August 9, 2022 11:43

Use Random as test dep

c2ec9e8

Update CHANGELOG for v2

0bedad9

Add DOCUMENTER_KEY

7f87995

Unexport seqlen

dd0971f

Touchup docs

7f0a64f

Fix more doctests

be6ea71

Make out-of-order Index files work

6e13aa2

More doc fixes

f30555b

Fix typo in CHANGELOG

54765bd

jakobnissen force-pushed the missingness branch from 04431b4 to 54765bd Compare August 9, 2022 09:44

jakobnissen mentioned this pull request Aug 9, 2022

[RFC]: Broaden characters allowed in FASTA sequences #75

Closed

kescobo mentioned this pull request Aug 9, 2022

Make Documenter preview builds work. #81

Closed

jakobnissen added 2 commits August 10, 2022 11:15

Improve parsing logic

020e384

Use one implementation of memcmp

e29fdc7

jakobnissen force-pushed the missingness branch from 4befa1f to e29fdc7 Compare August 10, 2022 09:15

jakobnissen added 3 commits August 10, 2022 12:37

GC preserve in validate_fastq

5c89b58

Remove Record!, add copy!(::Record, ::Record)

c3b1f3c

Remove use of BioGenerics.Automa.State

fb8ba3e

jakobnissen marked this pull request as draft August 10, 2022 18:22

jakobnissen added 5 commits August 10, 2022 20:32

Use finalizers for Writers

22fa94a

Make finalizers async

67f0fdb

Minor tweaks

145d444

Improve error message on parsing

b828fc6

Add index! function

c66d035

jakobnissen mentioned this pull request Aug 13, 2022

Use ReTest or similar niceness? #83

Closed

jakobnissen marked this pull request as ready for review August 13, 2022 10:19

kescobo mentioned this pull request Aug 14, 2022

Use ReTest #84

Merged

jakobnissen merged commit f44c498 into BioJulia:master Aug 15, 2022

jakobnissen deleted the missingness branch August 15, 2022 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite for v2 #68

Rewrite for v2 #68

jakobnissen commented Feb 20, 2022 •

edited

Loading

codecov bot commented Feb 20, 2022 •

edited

Loading

jakobnissen commented Feb 20, 2022

kescobo commented Feb 20, 2022 •

edited

Loading

CiaranOMara left a comment •

edited

Loading

CiaranOMara commented Feb 20, 2022

TransGirlCodes commented Mar 29, 2022 •

edited

Loading

TransGirlCodes commented Apr 3, 2022

jakobnissen commented Apr 4, 2022

jakobnissen commented Aug 5, 2022 •

edited

Loading

kescobo commented Aug 5, 2022

kescobo commented Aug 5, 2022

jakobnissen commented Aug 9, 2022 •

edited

Loading

kescobo commented Aug 9, 2022

Rewrite for v2 #68

Rewrite for v2 #68

Conversation

jakobnissen commented Feb 20, 2022 • edited Loading

Why a breaking change?

Important changes

External

Internal

codecov bot commented Feb 20, 2022 • edited Loading

Codecov Report

jakobnissen commented Feb 20, 2022

kescobo commented Feb 20, 2022 • edited Loading

CiaranOMara left a comment • edited Loading

Choose a reason for hiding this comment

CiaranOMara commented Feb 20, 2022

TransGirlCodes commented Mar 29, 2022 • edited Loading

TransGirlCodes commented Apr 3, 2022

jakobnissen commented Apr 4, 2022

jakobnissen commented Aug 5, 2022 • edited Loading

kescobo commented Aug 5, 2022

kescobo commented Aug 5, 2022

jakobnissen commented Aug 9, 2022 • edited Loading

kescobo commented Aug 9, 2022

jakobnissen commented Feb 20, 2022 •

edited

Loading

codecov bot commented Feb 20, 2022 •

edited

Loading

kescobo commented Feb 20, 2022 •

edited

Loading

CiaranOMara left a comment •

edited

Loading

TransGirlCodes commented Mar 29, 2022 •

edited

Loading

jakobnissen commented Aug 5, 2022 •

edited

Loading

jakobnissen commented Aug 9, 2022 •

edited

Loading