Skip to content

Commit

Permalink
Rewrite for v2 (#68)
Browse files Browse the repository at this point in the history
Complete rewrite for v2.
See the changelog for more details.
  • Loading branch information
jakobnissen authored Aug 15, 2022
1 parent f03bccf commit f44c498
Show file tree
Hide file tree
Showing 37 changed files with 3,624 additions and 2,456 deletions.
3 changes: 1 addition & 2 deletions .github/workflows/Documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,5 @@ jobs:
run: julia --color=yes --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
- name: Build and deploy
env:
# GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # For authentication with GitHub Actions token
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }} # For authentication with SSH deploy key
run: julia --color=yes --project=docs/ docs/make.jl
run: julia --color=yes --project=docs/ docs/make.jl
75 changes: 72 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,77 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## [2.0.0]
Version 2 is a near-complete rewrite of FASTX.
It brings strives to provide an easier and more consistent API, while also being
faster, more memory efficient, and better tested than v1.

The changes are comprehensive, but code should only need a few minor tweaks to
work with v2. I recommend upgrading your packages using a static analysis tool like JET.jl.

### Breaking changes
#### Records
* `description` has changed meaning: In v1, it meant the part of the header after the '>' symbol
and up until first whitespace. Now it extends to the whole header line until the ending newline.
This implies the identifier is a prefix of the description.
* `header` has been removed, and is now replaced by `description`.
* All `Record` objects now have an identifier, a description and a sequence, and all `FASTQRecord`s
have a quality. These may be empty, but will not throw an error when accessing them.
* As a consequence, all "checker" functions like `hassequence`, `isfilled`, `hasdescription` and
so on has been removed, since the answer now is trivially "yes" in all cases.
* `identifier`, `description`, `sequence` and `quality` now returns an `AbstractString` by default.
Although it is an implementation detail, it uses zero-copy string views for performance.
* You can no longer construct a record using e.g. `Record(::String)`. Instead, use `parse(Record, ::String)`.

#### Readers/writers
* All readers/writers now take any other arguments than the main IO as a keyword for clarity
and consistency.
* FASTQ.Writers will no longer by default modify `FASTQ.Records`'s second header.
An optional keyword forces the reader to always write/skip second header if set to `true` or `false`,
but it defaults to `nothing`, meaning it leaves it intact.
* FASTQ writers now can no longer fill in ambiguous bases in Records transparently,
or otherwise transform records, when writing.
If the user wishes to transform records, they must do it my manually calling a function that transforms the records.

#### Other breaking changes
* `FASTQ.Read` has been removed. To subset a read, extract the sequence and quality, and construct
a new Record object from these.
* `transcribe` has been removed, as it is now trivial to do the same thing.
It may be added in a future release with new functionality.

### New features
* Function `quality_scores` return the qualities of a FASTQ record as a lazy, validating iterator
of PHRED quality scores.
* New object: `QualityEncoding` can be used to construct custom PHRED/ASCII quality encodings.
accessing quality scores uses an existing default object.
* Readers now have a keyword `copy` that defaults to `true`. If set to `false`, iterating over
a reader will overwrite the same record for performance. Use with care.
This makes the old `while !eof(reader)`-idiom obsolete in favor of iterating over a reader
constructed with `copy=false`.
* Users can now use the following syntax to make processing gzipped readers easier:
```
Reader(GzipDecompressorStream(open(path)); kwargs...) do reader
# stuff
end
```
this is a change in BioGenerics.jl, but is guaranteed to work in FASTX.jl v2.
* FAI (FASTX index) files can now be written as well as read.
* FASTA files can now be indexed with the new function `faidx`.
* Function `extract` can extract parts of a sequence from an indexed FASTA reader
without loading the entire sequence into memory.
You can use this to e.g. extract a small part of a large chromosome. (see #29)
* New functions `validate_fasta` and `validate_fastq` validates if an `IO` is formatted
validly, faster and more memory-efficiently than loading in the file.

### Other changes
* All practically useful functions and types are now exported directly from FASTX,
so users don't need to prepend identifiers with `FASTA.` or `FASTQ.`.
* FASTA readers are more liberal in what formats they will accept (#73)

### Removed
* The method `FASTA.sequence(::FASTA.Record)` has been removed, since the auto-detection of sequence
type chould not be made reliable enough.

## [1.2.0] - 2021-07-13
### Added:
* `header(::Union{FASTA.Record, FASTQ.Record})` returns the full header line.
Expand All @@ -18,9 +89,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
* Various small fixes to the internal parsing regex
* Writers are now parametric and buffered for increased writing speed
* Fixed a bug where Windows-style newlines would break the parser

## Unreleased

[4;1386;2550t]
## [1.1.0] - 2019-08-07
### Added
- `Base.copyto!` methods for copying record data to LongSequences.
Expand Down
22 changes: 10 additions & 12 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,29 +1,27 @@
name = "FASTX"
uuid = "c2308a5c-f048-11e8-3e8a-31650f418d12"
authors = [
"Sabrina J. Ward <sabrinajward@protonmail.com>",
"Jakob N. Nissen <jakobnybonissen@gmail.com>"
]
version = "1.3.0"
authors = ["Sabrina J. Ward <sabrinajward@protonmail.com>", "Jakob N. Nissen <jakobnybonissen@gmail.com>"]
version = "2.0.0"

[deps]
Automa = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
BioGenerics = "47718e42-2ac5-11e9-14af-e5595289c2ea"
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9"
ScanByte = "7b38b023-a4d7-4c5e-8d43-3f3097f304eb"
StringViews = "354b36f9-a18e-4713-926e-db85100087ba"
TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa"

[compat]
Automa = "0.7, 0.8"
BioGenerics = "0.1"
Automa = "0.8"
BioGenerics = "0.1.2"
BioSequences = "3"
BioSymbols = "5"
ScanByte = "0.3"
StringViews = "1"
TranscodingStreams = "0.9.5"
julia = "1.6"

[extras]
FormatSpecimens = "3372ea36-2a1a-11e9-3eb7-996970b6ffbd"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[targets]
test = ["Test", "FormatSpecimens"]
test = ["Random"]
3 changes: 2 additions & 1 deletion docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"

[compat]
Documenter = "~0.22"
BioSequences = "3"
Documenter = "0.27"
27 changes: 14 additions & 13 deletions docs/make.jl
Original file line number Diff line number Diff line change
@@ -1,28 +1,29 @@
using Documenter, FASTX

# Build documentation.
DocMeta.setdocmeta!(FASTX, :DocTestSetup, :(using FASTX, BioSequences); recursive=true)

makedocs(
modules = [FASTX],
format = Documenter.HTML(),
modules = [FASTX, FASTX.FASTQ, FASTX.FASTA],
sitename = "FASTX.jl",
doctest = false,
strict = false,
doctest = true,
pages = [
"Home" => "index.md",
"Manual" => [
"FASTA formatted files" => "manual/fasta.md",
"FASTQ formatted files" => "manual/fastq.md"
"Overview" => Any[
"Overview" => "index.md",
"Records" => "records.md",
"File I/O" => "files.md",
],
"Library" => [
"Public" => "lib/public.md"
]
"FASTA" => "fasta.md",
"FASTQ" => "fastq.md",
"FAI" => "fai.md"
],
authors = "Ben J. Ward, The BioJulia Organisation and other contributors."
authors = "Sabrina J. Ward, Jakob N. Nissen, The BioJulia Organisation and other contributors.",
checkdocs = :all
)

deploydocs(
repo = "github.com/BioJulia/FASTX.jl.git",
push_preview = true,
deps = nothing,
make = nothing
)
)
148 changes: 148 additions & 0 deletions docs/src/fai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
```@meta
CurrentModule = FASTX
DocTestSetup = quote
using FASTX
end
```

# FASTA index (FAI files)
FASTX.jl supports FASTA index (FAI) files.
When a FASTA file is indexed with a FAI file, one can seek records by their name, or extract parts of records easily.

See the FAI specifcation here: http://www.htslib.org/doc/faidx.html

### Making an `Index`
A FASTA index (of type `Index`) can be constructed from an `IO` object representing a FAI file:

```jldoctest
julia> io = IOBuffer("seqname\t9\t2\t6\t8");
julia> Index(io) isa Index
true
```

Or from a path representing a FAI file:
```julia
julia> Index("/path/to/file.fai")
```

Alternatively, a FASTA file can be indexed to produce an `Index` using `faidx`.

```jldoctest
julia> faidx(IOBuffer(">abc\nTAGA\nTA"))
Index:
abc 6 5 4 5
```

Alternatively, a FASTA file can be indexed, and the index immediately written to a FAI file,
by passing an `AbstractString` to `faidx`:

```julia
julia> ispath("/path/to/fasta.fna.fai")
false

julia> faidx("/path/to/fasta.fna");

julia> ispath("/path/to/fasta.fna.fai")
true
```

Note that the restrictions on FASTA files for indexing are stricter than Julia's FASTA parser,
so not all FASTA files that can be read can be indexed:

```jldoctest
julia> str = ">\0\n\0";
julia> first(FASTAReader(IOBuffer(str))) isa FASTARecord
true
julia> Index(IOBuffer(str))
ERROR
[...]
```

### Attaching an `Index` to a `Reader`
When opening a `FASTA.Reader`, you can attach an `Index` by passing the `index` keyword.
You can either pass an `Index` directly, or else an `IO`, in which case an `Index` will be parsed from the `IO`,
or an `AbstractString` that will be interpreted as a path to a FAI file:

```jldoctest
julia> str = ">abc\nTAG\nTA";
julia> idx = faidx(IOBuffer(str));
julia> rdr = FASTAReader(IOBuffer(str), index=idx);
```

You can also add a index to an existing reader using the `index!` function:

```@docs
index!
```

### Seeking using an `Index`
With an `Index` attached to a `Reader`, you can do the following operation in O(1) time.
In these examples, we will use the following FASTA file:

```
>seq1 sequence
TAGAAAGCAA
TTAAAC
>seq2 sequence
AACGG
UUGC
```

```@meta
DocTestSetup = quote
using FASTX
data = """>seq1 sequence
TAGAAAGCAA
TTAAAC
>seq2 sequence
AACGG
UUGC
"""
reader = FASTA.Reader(IOBuffer(data), index=faidx(IOBuffer(data)))
end
```

* Seek to a Record using its identifier:
```jldoctest
julia> seekrecord(reader, "seq2");
julia> record = first(reader); sequence(record)
"AACGGUUGC"
```

* Directly extract a record using its identifier
```jldoctest
julia> record = reader["seq1"];
julia> description(record)
"seq1 sequence"
```

* Extract a sequence directly without loading the whole record into memory.
This is useful for huge sequences like chromosomes
```jldoctest
julia> extract(reader, "seq1", 3:5)
"GAA"
```

```@meta
DocTestSetup = nothing
```

FASTX.jl does not yet support indexing FASTQ files.

### Reference:
```@docs
faidx
seekrecord
extract
Index
```
59 changes: 59 additions & 0 deletions docs/src/fasta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
```@meta
CurrentModule = FASTX
DocTestSetup = quote
using FASTX
end
```

# FASTA formatted files
__NB: First read the overview in the sidebar__

FASTA is a text-based file format for representing biological sequences.
A FASTA file stores a list of sequence records with name, description, and
sequence.

The template of a sequence record is:

```
>{description}
{sequence}
```

Where the "identifier" is the first part of the description up to the first whitespace
(or the entire description if there is no whitespace)

Here is an example of a chromosomal sequence:

```
>chrI chromosome 1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTA
```

## The `FASTARecord`
FASTA records are, by design, very lax in what they can contain.
They can contain almost arbitrary byte sequences, including invalid unicode, and trailing whitespace on their sequence lines, which will be interpreted as part of the sequence.
If you want to have more certainty about the format, you can either check the content of the sequences with a regex, or (preferably), convert them to the desired `BioSequence` type.

```@docs
FASTA.Record
```

### Reference:
```@docs
identifier
description
sequence
```

## `FASTAReader` and `FASTAWriter`
`FASTAWriter` can optionally be passed the keyword `width` to control the line width.
If this is zero or negative, it will write all record sequences on a single line.
Else, it will wrap lines to the given maximal width.

### Reference:
```@docs
FASTA.Reader
FASTA.Writer
validate_fasta
```
Loading

0 comments on commit f44c498

Please sign in to comment.