Rewrite for v2 (#68)

Complete rewrite for v2. See the changelog for more details.
BioJulia · Aug 15, 2022 · f44c498 · f44c498
1 parent f03bccf
commit f44c498
Show file tree

Hide file tree

Showing 37 changed files with 3,624 additions and 2,456 deletions.
diff --git a/.github/workflows/Documentation.yml b/.github/workflows/Documentation.yml
@@ -21,6 +21,5 @@ jobs:
         run: julia --color=yes --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
       - name: Build and deploy
         env:
-          # GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # For authentication with GitHub Actions token
           DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }} # For authentication with SSH deploy key
-        run: julia --color=yes --project=docs/ docs/make.jl
+        run: julia --color=yes --project=docs/ docs/make.jl
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,77 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
 
+## [2.0.0]
+Version 2 is a near-complete rewrite of FASTX.
+It brings strives to provide an easier and more consistent API, while also being
+faster, more memory efficient, and better tested than v1.
+
+The changes are comprehensive, but code should only need a few minor tweaks to
+work with v2. I recommend upgrading your packages using a static analysis tool like JET.jl.
+
+### Breaking changes
+#### Records
+* `description` has changed meaning: In v1, it meant the part of the header after the '>' symbol
+  and up until first whitespace. Now it extends to the whole header line until the ending newline.
+  This implies the identifier is a prefix of the description.
+* `header` has been removed, and is now replaced by `description`.
+* All `Record` objects now have an identifier, a description and a sequence, and all `FASTQRecord`s
+  have a quality. These may be empty, but will not throw an error when accessing them.
+* As a consequence, all "checker" functions like `hassequence`, `isfilled`, `hasdescription` and
+  so on has been removed, since the answer now is trivially "yes" in all cases.
+* `identifier`, `description`, `sequence` and `quality` now returns an `AbstractString` by default.
+  Although it is an implementation detail, it uses zero-copy string views for performance.
+* You can no longer construct a record using e.g. `Record(::String)`. Instead, use `parse(Record, ::String)`.
+
+#### Readers/writers
+* All readers/writers now take any other arguments than the main IO as a keyword for clarity
+  and consistency.
+* FASTQ.Writers will no longer by default modify `FASTQ.Records`'s second header.
+  An optional keyword forces the reader to always write/skip second header if set to `true` or `false`,
+  but it defaults to `nothing`, meaning it leaves it intact.
+* FASTQ writers now can no longer fill in ambiguous bases in Records transparently,
+  or otherwise transform records, when writing.
+  If the user wishes to transform records, they must do it my manually calling a function that transforms the records.
+
+#### Other breaking changes
+* `FASTQ.Read` has been removed. To subset a read, extract the sequence and quality, and construct
+  a new Record object from these.
+* `transcribe` has been removed, as it is now trivial to do the same thing.
+  It may be added in a future release with new functionality.
+
+### New features
+* Function `quality_scores` return the qualities of a FASTQ record as a lazy, validating iterator
+  of PHRED quality scores.
+* New object: `QualityEncoding` can be used to construct custom PHRED/ASCII quality encodings.
+  accessing quality scores uses an existing default object.
+* Readers now have a keyword `copy` that defaults to `true`. If set to `false`, iterating over
+  a reader will overwrite the same record for performance. Use with care.
+  This makes the old `while !eof(reader)`-idiom obsolete in favor of iterating over a reader
+  constructed with `copy=false`.
+* Users can now use the following syntax to make processing gzipped readers easier:
+  ```
+  Reader(GzipDecompressorStream(open(path)); kwargs...) do reader
+      # stuff
+  end
+  ```
+  this is a change in BioGenerics.jl, but is guaranteed to work in FASTX.jl v2.
+* FAI (FASTX index) files can now be written as well as read.
+* FASTA files can now be indexed with the new function `faidx`.
+* Function `extract` can extract parts of a sequence from an indexed FASTA reader
+  without loading the entire sequence into memory.
+  You can use this to e.g. extract a small part of a large chromosome. (see #29)
+* New functions `validate_fasta` and `validate_fastq` validates if an `IO` is formatted
+  validly, faster and more memory-efficiently than loading in the file.
+
+### Other changes
+* All practically useful functions and types are now exported directly from FASTX,
+  so users don't need to prepend identifiers with `FASTA.` or `FASTQ.`.
+* FASTA readers are more liberal in what formats they will accept (#73)  
+
+### Removed
+* The method `FASTA.sequence(::FASTA.Record)` has been removed, since the auto-detection of sequence
+  type chould not be made reliable enough.
+
 ## [1.2.0] - 2021-07-13
 ### Added:
 * `header(::Union{FASTA.Record, FASTQ.Record})` returns the full header line.
@@ -18,9 +89,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 * Various small fixes to the internal parsing regex
 * Writers are now parametric and buffered for increased writing speed
 * Fixed a bug where Windows-style newlines would break the parser
-
-## Unreleased
-
+[4;1386;2550t]
 ## [1.1.0] - 2019-08-07
 ### Added
 - `Base.copyto!` methods for copying record data to LongSequences.

diff --git a/Project.toml b/Project.toml
@@ -1,29 +1,27 @@
 name = "FASTX"
 uuid = "c2308a5c-f048-11e8-3e8a-31650f418d12"
-authors = [
-    "Sabrina J. Ward <sabrinajward@protonmail.com>",
-    "Jakob N. Nissen <jakobnybonissen@gmail.com>"
-]
-version = "1.3.0"
+authors = ["Sabrina J. Ward <sabrinajward@protonmail.com>", "Jakob N. Nissen <jakobnybonissen@gmail.com>"]
+version = "2.0.0"
 
 [deps]
 Automa = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
 BioGenerics = "47718e42-2ac5-11e9-14af-e5595289c2ea"
 BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
-BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9"
+ScanByte = "7b38b023-a4d7-4c5e-8d43-3f3097f304eb"
+StringViews = "354b36f9-a18e-4713-926e-db85100087ba"
 TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa"
 
 [compat]
-Automa = "0.7, 0.8"
-BioGenerics = "0.1"
+Automa = "0.8"
+BioGenerics = "0.1.2"
 BioSequences = "3"
-BioSymbols = "5"
+ScanByte = "0.3"
+StringViews = "1"
 TranscodingStreams = "0.9.5"
 julia = "1.6"
 
 [extras]
-FormatSpecimens = "3372ea36-2a1a-11e9-3eb7-996970b6ffbd"
-Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
+Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
 
 [targets]
-test = ["Test", "FormatSpecimens"]
+test = ["Random"]
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -3,4 +3,5 @@ BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 
 [compat]
-Documenter = "~0.22"
+BioSequences = "3"
+Documenter = "0.27" 
diff --git a/docs/make.jl b/docs/make.jl
@@ -1,28 +1,29 @@
 using Documenter, FASTX
 
-# Build documentation.
+DocMeta.setdocmeta!(FASTX, :DocTestSetup, :(using FASTX, BioSequences); recursive=true)
 
 makedocs(
+    modules = [FASTX],
     format = Documenter.HTML(),
-    modules = [FASTX, FASTX.FASTQ, FASTX.FASTA],
     sitename = "FASTX.jl",
-    doctest = false,
-    strict = false,
+    doctest = true,
     pages = [
-        "Home" => "index.md",
-        "Manual" => [
-            "FASTA formatted files" => "manual/fasta.md",
-            "FASTQ formatted files" => "manual/fastq.md"
+        "Overview" => Any[
+            "Overview" => "index.md",
+            "Records" => "records.md",
+            "File I/O" => "files.md",
         ],
-        "Library" => [
-            "Public" => "lib/public.md"
-        ]
+        "FASTA" => "fasta.md",
+        "FASTQ" => "fastq.md",
+        "FAI" => "fai.md"
     ],
-    authors = "Ben J. Ward, The BioJulia Organisation and other contributors."
+    authors = "Sabrina J. Ward, Jakob N. Nissen, The BioJulia Organisation and other contributors.",
+    checkdocs = :all
 )
 
 deploydocs(
     repo = "github.com/BioJulia/FASTX.jl.git",
+    push_preview = true,
     deps = nothing,
     make = nothing
-)
+)
diff --git a/docs/src/fai.md b/docs/src/fai.md
@@ -0,0 +1,148 @@
+```@meta
+CurrentModule = FASTX
+DocTestSetup = quote
+    using FASTX
+end
+```
+
+# FASTA index (FAI files)
+FASTX.jl supports FASTA index (FAI) files.
+When a FASTA file is indexed with a FAI file, one can seek records by their name, or extract parts of records easily.
+
+See the FAI specifcation here: http://www.htslib.org/doc/faidx.html
+
+### Making an `Index`
+A FASTA index (of type `Index`) can be constructed from an `IO` object representing a FAI file:
+
+```jldoctest
+julia> io = IOBuffer("seqname\t9\t2\t6\t8");
+
+julia> Index(io) isa Index
+true
+```
+
+Or from a path representing a FAI file:
+```julia
+julia> Index("/path/to/file.fai")
+```
+
+Alternatively, a FASTA file can be indexed to produce an `Index` using `faidx`.
+
+```jldoctest
+julia> faidx(IOBuffer(">abc\nTAGA\nTA"))
+Index:
+  abc	6	5	4	5
+```
+
+Alternatively, a FASTA file can be indexed, and the index immediately written to a FAI file,
+by passing an `AbstractString` to `faidx`:
+
+```julia
+julia> ispath("/path/to/fasta.fna.fai")
+false
+
+julia> faidx("/path/to/fasta.fna");
+
+julia> ispath("/path/to/fasta.fna.fai")
+true
+```
+
+Note that the restrictions on FASTA files for indexing are stricter than Julia's FASTA parser,
+so not all FASTA files that can be read can be indexed:
+
+```jldoctest
+julia> str = ">\0\n\0";
+
+julia> first(FASTAReader(IOBuffer(str))) isa FASTARecord
+true
+
+julia> Index(IOBuffer(str))
+ERROR
+[...]
+```
+
+### Attaching an `Index` to a `Reader`
+When opening a `FASTA.Reader`, you can attach an `Index` by passing the `index` keyword.
+You can either pass an `Index` directly, or else an `IO`, in which case an `Index` will be parsed from the `IO`,
+or an `AbstractString` that will be interpreted as a path to a FAI file:
+
+```jldoctest
+julia> str = ">abc\nTAG\nTA";
+
+julia> idx = faidx(IOBuffer(str));
+
+julia> rdr = FASTAReader(IOBuffer(str), index=idx);
+```
+
+You can also add a index to an existing reader using the `index!` function:
+
+```@docs
+index!
+```
+
+### Seeking using an `Index`
+With an `Index` attached to a `Reader`, you can do the following operation in O(1) time.
+In these examples, we will use the following FASTA file:
+
+```
+>seq1 sequence
+TAGAAAGCAA
+TTAAAC
+>seq2 sequence
+AACGG
+UUGC
+```
+
+```@meta
+DocTestSetup = quote
+using FASTX
+
+data = """>seq1 sequence
+TAGAAAGCAA
+TTAAAC
+>seq2 sequence
+AACGG
+UUGC
+"""
+
+reader = FASTA.Reader(IOBuffer(data), index=faidx(IOBuffer(data)))
+
+end
+```
+
+* Seek to a Record using its identifier:
+```jldoctest
+julia> seekrecord(reader, "seq2");
+
+julia> record = first(reader); sequence(record)
+"AACGGUUGC"
+```
+
+* Directly extract a record using its identifier
+```jldoctest
+julia> record = reader["seq1"];
+
+julia> description(record)
+"seq1 sequence"
+```
+
+* Extract a sequence directly without loading the whole record into memory.
+  This is useful for huge sequences like chromosomes
+```jldoctest
+julia> extract(reader, "seq1", 3:5)
+"GAA"
+```
+
+```@meta
+DocTestSetup = nothing
+```
+
+FASTX.jl does not yet support indexing FASTQ files.
+
+### Reference:
+```@docs
+faidx
+seekrecord
+extract
+Index
+```
diff --git a/docs/src/fasta.md b/docs/src/fasta.md
@@ -0,0 +1,59 @@
+```@meta
+CurrentModule = FASTX
+DocTestSetup = quote
+    using FASTX
+end
+```
+
+# FASTA formatted files
+__NB: First read the overview in the sidebar__
+
+FASTA is a text-based file format for representing biological sequences.
+A FASTA file stores a list of sequence records with name, description, and
+sequence.
+
+The template of a sequence record is:
+
+```
+>{description}
+{sequence}
+```
+
+Where the "identifier" is the first part of the description up to the first whitespace
+(or the entire description if there is no whitespace)
+
+Here is an example of a chromosomal sequence:
+
+```
+>chrI chromosome 1
+CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
+CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA
+```
+
+## The `FASTARecord`
+FASTA records are, by design, very lax in what they can contain.
+They can contain almost arbitrary byte sequences, including invalid unicode, and trailing whitespace on their sequence lines, which will be interpreted as part of the sequence.
+If you want to have more certainty about the format, you can either check the content of the sequences with a regex, or (preferably), convert them to the desired `BioSequence` type.
+
+```@docs
+FASTA.Record
+```
+
+### Reference:
+```@docs
+identifier
+description
+sequence
+```
+
+## `FASTAReader` and `FASTAWriter`
+`FASTAWriter` can optionally be passed the keyword `width` to control the line width.
+If this is zero or negative, it will write all record sequences on a single line.
+Else, it will wrap lines to the given maximal width.
+
+### Reference:
+```@docs
+FASTA.Reader
+FASTA.Writer
+validate_fasta
+```