Skip to content
Thomas Roder edited this page Sep 19, 2024 · 11 revisions

Required inputs

Traits

Two different file formats are supported: binary and numeric. The parameter --trait-data-type tells Scoary2 how to read the traits file.

The file may contain missing values, indicated by one of the following symbols: NA, NaN, -, . or empty string.

binary traits:

Trait trait-1 trait-2 trait-3
isolate-1 1 1 0
isolate-2 1 0 1
isolate-3 0 0 0

Set --trait-data-type to binary:<delimiter>, e.g. binary:\t.

numeric traits:

Trait trait-1 trait-2 trait-3
isolate-1 5.3 7.4 1.1
isolate-2 4.2 2.1 7.3
isolate-3 1.4 5.6 5.3

For example, set --trait-data-type to gaussian:kmeans:\t to:

  1. attempt splitting each trait with GaussianMixture
  2. if it fails, split the trait using KMeans
Click here to learn how `--trait-data-type` works:

Set --trait-data-type to <method>:<?cutoff>:<?covariance_type>:<?alternative>:<?delimiter>:

parameter possible values default
method binary, gaussian, kmeans binary
delimiter any single character ,
cutoff¹ .5 <= cutoff < 1 0.85
covariance_type¹ tied, full, diag, spherical tied
alternative¹ skip, kmeans skip

¹: Only relevant if method is gaussian. Meaning:

  • cutoff: Determines how confident the Gaussian mixture model must be to classify an isolate. Must be a number between 0.5 and 1. Default: 0.85.
  • covariance_type: See GaussianMixture documentation.
  • alternative: What to do if the trait cannot be split with GaussianMixture. Possible values: skip or kmeans. Default: skip.

Genes

Two different file formats are supported: gene-count and gene-list. The parameter --gene-data-type tells Scoary2 how to read the genes file, for example:

  • --gene-data-type gene-count:,: the genes-file is of type gene-count and the delimiter is , (Roary)
  • --gene-data-type gene-list:\t: the genes-file is of type gene-list and the delimiter is \t (OrthoFinder)

This file may not contain missing values.

gene-count:

Gene isolate-1 isolate-2 isolate-3
gene-1 0 1 1
gene-2 2 1 0
gene-3 1 2 0

Note: Only columns with a column name that corresponds to an isolate are kept. In other words, output from Roary can directly be used, but the Annotation column is ignored. (To add this information, see gene-info)

gene-list:

This file format contains the gene names which may be useful in working with the data (see Output).

Gene isolate-1 isolate-2 isolate-3
gene-1 isolate-2_0123 isolate-3_3252
gene-2 isolate-1_4323,isolate-1_0935 isolate-2_0456
gene-3 isolate-1_1271 isolate-2_0005,isolate-2_0902

Optional inputs

trait-info

A tab-separated file that describes traits, for example:

Trait Description
trait_1 resistance to tetracycline
trait_2 facilitates gut adherance
trait_3 digests tetracycline

The table may contain multiple columns.

gene-info

A tab-separated file that describes orthogenes, for example:

Gene Annotation
N0.HOG0000000 amino acid ABC transporter
N0.HOG0000001 IS30 family transposase
N0.HOG0000002 IS5/IS1182 family transposase

The table may contain multiple columns.

Such a file can easily be generated from OrthoFinder, see Tutorial

isolate-info

A tab-separated file that describes isolates, for example:

Isolate Species
isolate-1 Streptococcus thermophilus
isolate-2 Lentilactobacillus parafarraginis
isolate-3 Streptococcus thermophilus

The table may contain multiple columns.