isONclust2 - a tool for de novo clustering of long transcriptomic reads

isONclust2 is a tool for clustering long transcriptomic reads into gene families. The tool is based on the approach pioneered by isONclust, using minimizers and occasional pairwise alignment.

isONclust2 is implemented in C++, which makes it fast enough to cluster large transcriptomic datasets produced on PromethION P24 and P48 devices. The tool is not a re-implementation of the original isONclust approach, as it deals with the strandedness of the reads and provides further optional features.

WARNING: In order to be able to handle large datasets, isONclust2 splits the input data into batches which have to be processed in a specified order to obtain the results. Hence, the use of isONclust2 as a standalone tool is highly discouraged and one should always use it through the de novo transcriptomics pipeline at https://github.com/nanoporetech/pipeline-nanopore-denovo-isoforms.

Getting Started

Installation

The best way to install isONclust2 is from bioconda:

Make sure you have miniconda3 installed.
Install the tool by issuing conda install -c bioconda isonclust2

Compiling from source

Clone the repository: git clone --recursive https://github.com/nanoporetech/isONclust2.git
Make sure you have cmake v3.1 or later.
Issue cd isONclust2; mkdir build; cd build; cmake ..; make -j
The produced binary is static with no library dependencies. Link this under your path to use the tool.

Usage

Help message

isONclust2 version: v2.3-a0e5b32
Available subcommands: sort, cluster, dump, info, help, version

sort - sort reads and write out batches:
        -B --batch-size        Batch size in kilobases (default: 50000)
        -M --batch-max-seq     Maximum number of sequences per batch (default: 3000).
        -k --kmer-size         Kmer size (default: 11).
        -w --window-size       Window size (default: 15).
        -m --min-shared        Minimum number of minimizers shared between read and cluster (default: 5).
        -q --min-qual          Minimum average quality value (default: 7.0).
        -x --mode  Clustering mode:
                   * sahlin (default): use minimizers first, alignment second
                   * fast: use minimizers only
                   * furious: always use alignment
        -g --low-cons-size     Use all sequences for consensus below this size (default: 20).
        -c --max-cons-size     Maximum number of sequences used for consensus (default: 150).
        -P --cons-period       Do not recalculate consensus after this many seuqences added (default: 500).
        -r --mapped-threshold  Minmum mapped fraction of read to be     included in cluster (default: 0.65).
        -a --aligned-threshold Minimum aligned fraction of read to be included in cluster (default: 0.2).
        -f --min-fraction      Minimum fraction of minimizers shared compared to best hit, in order to continue mapping (default: 0.8).
        -p --min-prob-no-hits  Minimum probability for i consecutive    minimizers to be different between read and representative (default: 0.1)
        -F --min-cls-size      Skip clusters smaller than this in the left batch (default: 3).
        -o --outfolder         Output folder (default:  ./isONclust2_batches).
        -h --help              Print help.
        -v --verbose           Verbose output.
        -d --debug             Print debug info.
        [positional argument]  Input fastq file (required).

cluster - cluster and/or merge batches:
        -l --left-batch        Left input batch (mandatory).
        -r --right-batch       Right input batch (optional).
        -o --outfile           Output batch.
        -x --mode  Clustering mode:
                   * sahlin (default): use minimizers first, alignment second
                   * fast: use minimizers only
                   * furious: use alignment only
        -A --spoa-algo  spoa alignment algorithm:
                   * 0 (default): local
                   * 1 : global
                   * 1 : semi-global
        -z --min-purge         Purge minimizer database from output batch.
        -j --keep-seq          Do not purge non-representative sequences from output batches.
        -F --min-cls-size      Skip clusters smaller than this in the left batch.
        -v --verbose           Verbose output.
        -Q --quiet             Supress progress bar.
        -d --debug             Print debug info.
        -h --help              Print help.

dump - dump clustered batch:
        -o --outdir            Output directory.
        -i --index             Index of sorted reads.
        -v --verbose           Verbose output.
        -d --debug             Print debug info.
        -h --help              Print help.

info:
        -h --help              Print help.
        [positional argument]  Input serialized batch (required).

help - print help message

version - print version

A minimal example

# sort reads and write out batches:
isONclust2 sort -B 50000 -v ens500.fq
# initial clustering of individual batches:
isONclust2 cluster -v -l isONclust2_batches/sorted/batches/isONbatch_0.cer -o b0.cer
isONclust2 cluster -v -l isONclust2_batches/sorted/batches/isONbatch_1.cer -o b1.cer
isONclust2 cluster -v -l isONclust2_batches/sorted/batches/isONbatch_2.cer -o b1.cer
# merge cluster batches:
isONclust2 cluster -v -l b0.cer -r b1.cer -o b_0_1.cer
isONclust2 cluster -v -l b_0_1.cer -r b2.cer -o b_0_1_2.cer
# dump final results:
isONclust2 dump -v -i sorted/sorted_reads_idx.cer -o results b_0_1_2.cer

Help

Acknowledgements

This software was built in collaboration with Kristoffer Sahlin and Paul Medvedev.

Licence and Copyright

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

FAQs and tips

References and Supporting Information

See the post announcing the transcriptomics tools at the Nanopore Community here.

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.circleci		.circleci
docs		docs
src		src
test		test
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
COPYRIGHT		COPYRIGHT
Doxyfile		Doxyfile
LICENSE.md		LICENSE.md
Makefile		Makefile
ONT_logo.png		ONT_logo.png
README.md		README.md

License

nanoporetech/isONclust2

Folders and files

Latest commit

History

Repository files navigation

isONclust2 - a tool for de novo clustering of long transcriptomic reads

Getting Started

Installation

Compiling from source

Usage

Help message

A minimal example

Help

Acknowledgements

Licence and Copyright

FAQs and tips

References and Supporting Information

Research Release

About

Topics

Resources

License

Stars

Watchers

Forks

Languages