-
Notifications
You must be signed in to change notification settings - Fork 1
Input
Two different file formats are supported: binary
and numeric
.
The parameter --trait-data-type
tells Scoary2 how to read the traits file.
The file may contain missing values, indicated by one of the following symbols: NA
, NaN
, -
, .
or empty string.
binary traits:
Trait | trait-1 | trait-2 | trait-3 |
---|---|---|---|
isolate-1 | 1 | 1 | 0 |
isolate-2 | 1 | 0 | 1 |
isolate-3 | 0 | 0 | 0 |
Set --trait-data-type
to binary:<delimiter>
, e.g. binary:\t
.
numeric traits:
Trait | trait-1 | trait-2 | trait-3 |
---|---|---|---|
isolate-1 | 5.3 | 7.4 | 1.1 |
isolate-2 | 4.2 | 2.1 | 7.3 |
isolate-3 | 1.4 | 5.6 | 5.3 |
For example, set --trait-data-type
to gaussian:kmeans:\t
to:
- attempt splitting each trait with GaussianMixture
- if it fails, split the trait using KMeans
Click here to learn how `--trait-data-type` works:
Set --trait-data-type
to <method>:<?cutoff>:<?covariance_type>:<?alternative>:<?delimiter>
:
parameter | possible values | default |
---|---|---|
method |
binary , gaussian , kmeans
|
binary |
delimiter |
any single character | , |
cutoff ¹ |
.5 <= cutoff < 1 | 0.85 |
covariance_type ¹ |
tied , full , diag , spherical
|
tied |
alternative ¹ |
skip , kmeans
|
skip |
¹: Only relevant if method
is gaussian
. Meaning:
-
cutoff
: Determines how confident the Gaussian mixture model must be to classify an isolate. Must be a number between 0.5 and 1. Default: 0.85. -
covariance_type
: See GaussianMixture documentation. -
alternative
: What to do if the trait cannot be split with GaussianMixture. Possible values:skip
orkmeans
. Default:skip
.
Two different file formats are supported: gene-count
and gene-list
.
The parameter --gene-data-type
tells Scoary2 how to read the genes file, for example:
-
--gene-data-type gene-count:,
: the genes-file is of typegene-count
and the delimiter is,
(Roary) -
--gene-data-type gene-list:\t
: the genes-file is of typegene-list
and the delimiter is\t
(OrthoFinder)
This file may not contain missing values.
gene-count:
Gene | isolate-1 | isolate-2 | isolate-3 |
---|---|---|---|
gene-1 | 0 | 1 | 1 |
gene-2 | 2 | 1 | 0 |
gene-3 | 1 | 2 | 0 |
Note: Only columns with a column name that corresponds to an isolate are kept. In other words, output from Roary can
directly be used, but the Annotation
column is ignored. (To add this information, see gene-info)
gene-list:
This file format contains the gene names which may be useful in working with the data (see Output).
Gene | isolate-1 | isolate-2 | isolate-3 |
---|---|---|---|
gene-1 | isolate-2_0123 | isolate-3_3252 | |
gene-2 | isolate-1_4323,isolate-1_0935 | isolate-2_0456 | |
gene-3 | isolate-1_1271 | isolate-2_0005,isolate-2_0902 |
A tab-separated file that describes traits, for example:
Trait | Description |
---|---|
trait_1 | resistance to tetracycline |
trait_2 | facilitates gut adherance |
trait_3 | digests tetracycline |
The table may contain multiple columns.
A tab-separated file that describes orthogenes, for example:
Gene | Annotation |
---|---|
N0.HOG0000000 | amino acid ABC transporter |
N0.HOG0000001 | IS30 family transposase |
N0.HOG0000002 | IS5/IS1182 family transposase |
The table may contain multiple columns.
Such a file can easily be generated from OrthoFinder, see Tutorial
A tab-separated file that describes isolates, for example:
Isolate | Species |
---|---|
isolate-1 | Streptococcus thermophilus |
isolate-2 | Lentilactobacillus parafarraginis |
isolate-3 | Streptococcus thermophilus |
The table may contain multiple columns.