# Gaboid documentation

The command line execution of the graboid software is divided into modes *Database*, *Mapping*, *Calibrate*, *Report*, *Classify*. Within the designated working directory, subdirectories *results*, *calibration*, *data*, *tmp* and *warnings*. A dictionary containing generated files is pickled as a*catalog.pickle* file.

## Database
### Usage
**graboid DATABASE \-\-work_dir \-\-taxon \-\-marker \-\-fasta \-\-bold \-\-ranks \-\-chunksize \-\-max_attempts \-\-mv \-\-keep_tmp**

### Description
In this step, the program builds the reference database to be used in the classification set. The reference database is made up of a fasta file and a taxonomic table containing the taxon of each sequence for a specified set of taxonomic tanks. The sequence file may be provided by the user as a pre generated fasta file (**\-\-fasta**) or built using a cross search of the genomic databases for a given taxon/marker pair (**\-\-taxon**, **\-\-marker**).

### Arguments
**\-\-work_dir** \<path to working directory\>

>Working directory to contain the generated files. The value provided must remain constant for all steps that use the generated database.

**-T**, **\-\-taxon** \<taxon\>

>Taxon to be used in the cross search. Provided value should belong to a sufficiently high rank (phylum/class).

**-M**, **\-\-marker** \<marker\>

>Genetic marker or gene to be used in the cross search.

**-F**, **\-\-fasta** \<path to fasta file\>

>Fasta file containing the sequences to be included in the reference database. Overrides **--taxon** and **--marker** if a value is provided.

**\-\-bold**

>Toggle ussage of the BOLD database. If included, the cross search includes the BOLD database in adition to Genbank.

**-r**, **\-\-ranks** \<rank1\> \<rank2\> ...

>Ranks to be included in the retrieved taxonomic table. Default values are *phylum*, *class*, *order*, *family*, *genus* and *species*

**-c**, **\-\-chunksize** = 500

>The set of sequences to be downloaded is partitioned into segments of *chunksize* elements. Default value is 500.

**-m**, **\-\-max_attempts** = 3

>Determines the maximum number of failed attempts at retrieving a seuqnece chunk before writing it off as a failure.  The program attempts a second download run to retrieved failed sequences. Sequences that fail to download after this are listed in a warning file.

**\-\-mv**

>Toggles move of the provided fasta file. If enabled, the file provided as **--fasta** is moved to the working directory instead of copied.

**\-\-keep_tmp**

>Toggles deletion of temporal files. If provided, the generated temporal files are kept, otherwise, the temporal directory is wiped.

## Mapper
### Usage
**graboid MAPPER \-\-work_dir \-\-base_seq \-\-db_dir \-\-out_name \-\-ref_name \-\-evalue \-\-threads**

### Description
Align the sequences retrieved in the *DATABASE* step against a baseline sequence of the specified marker and process result into a numeric matrix. Additionally, quantifies information present in each site and generates an order file containing the sites for each taxon sorted by descending information content.

### Arguments
**\-\-work_dir** \<path to working directory\>

>Working directory to contain the generated files. The value provided must remain constant for all steps that use the generated database.

**-B**, **\-\-base_seq** \<base sequence file\>

>Marker sequence to be used as base of the sequence alignment.

**-db**, **\-\-db_dir** \<path to blast database directory\>

>OPTIONAL. If a blast database of the baseline sequence is already available, it may be provided in place of **--base_seq**

**-o**, **\-\-out_name** \<output file name\>

>OPTIONAL. Alternative name for the generated files.

**-bn**, **\-\-blast_name** \<blast report file name\>

>OPTIONAL. Alternative name for the blast report.

**-e**, **\-\-evalue** = 0.005

>E-value threshold for the blast alignment. Default value is 0.005.

**-t**, **\-\-threads** = 1

>Number of threads to use in the blast alignment. Default value is 1.

## Calibrate
### Usage
**graboid CALIBRATE \-\-work_dir \-\-row_thresh \-\-col_thresh \-\-min_seqs \-\-rank \-\-dist_mat \-\-w_size \-\-w_step \-\-max_k \-\-step_k \-\-max_n \-\-step_n \-\-min_k \-\-min_n \-\-out_file**

### Description
Perform a grid search for combinations of *window*, *k* and *n* values using the *leave-one-out* method to assess performance for each taxon. Generates a calibration report containing the *Accuracy*, *Precision*, *Recall* and *F1 score* for each taxon in each parameter combination.

### Arguments
**\-\-work_dir** \<path to working directory\>

>Working directory to contain the generated files. The value provided must remain constant for all steps that use the generated database.

**-rt**, **\-\-row_thresh** = 0.2

>Empty row threshold. Percentage threshold of empty values to filter an alignment row. Default value is 0.2.

**-ct**, **\-\-col_thresh** = 0.2

>Empty column threshold. Percentage threshold of empty values to filter an alignment column. Default value is 0.2.

**-ms**, **\-\-min_seqs** = 10

>Minimum number of sequences needed for a taxon to be considered. Default value is 10.

**-rk**, **\-\-rank** = 'genus'

>Taxonomic rank used for attribute selection. Default value is *genus*.

**-dm**, **\-\-dist_mat** \<distance matrix\>

>Distance matrix to be used in the neighbour calculation. String *id* to use the identity matrix. String *s1v2* to use the transition = 1, transversion = 2 matrix. A custom matrix can be used by passing it as a file path.

**-wz**, **\-\-w_size** = 200

>Window size to be used in each iteration. Default value is 200.

**-ws**, **\-\-w_step** = 15

>Window displacement between iterations. Default value is 15.

**-mk**, **\-\-max_k**

>Maximum *k* value to be explored. Default value is 15.

**-sk**, **\-\-step_k**

>Increment in *k* value between iterations. Default value is 2.

**-mn**, **\-\-max_n**

>Maximum *n* value to be explored. Default value is 30.

**-sn**, **\-\-step_n**

>Increment in *n* value between iterations. Default value is 5.

**-nk**, **\-\-min_k**

>Starting *k* value. Default value is 1.

**-nn**, **\-\-min_n**

>Starting *n* value. Default value is 5.

**-o**, **\-\-out_file**

>OPTIONAL. File name for the generated report. If none is provided, the file name will be generated using the calibration parameters.

## Report

## Classify
### Usage
**graboid CLASSIFY \-\-work_dir \-\-query_file \-\-dist_mat \-\-w_start \-\-w_end \-\-k \-\-n \-\-cl_mode \-\-rank \-\-out_name \-\-keep_tmp**

### Description
Classify the sequences in the file given as **\-\-query_fasta**.

### Arguments
**--work_dir** \<path to working directory\>

>Working directory to contain the generated files. The value provided must remain constant for all steps that use the generated database.

**-q**, **\-\-query_file** \<path to query sequence file\>

>File containing the query sequences. Fasta format.

**-dm**, **\-\-dist_mat** \<distance matrix\>

>Distance matrix to be used in the neighbour calculation. String id to use the identity matrix. String s1v2 to use the transition = 1, transversion = 2 matrix. A custom matrix can be used by passing it as a file path.

**-ws**, **\-\-w_start** \<window start\>

>Starting coordinates for the window of the alignment to use in classification.

**-we**, **\-\-w_end** \<window end\>

>End coordinates for the window of the alignment to use in classification.

**\-\-k** \<k1\> \<k2\>...

>K values to use in classification. Multiple values can be provided to assess consistency in classification results.

**\-\-n** \<number of sites\>

>Number of informative sites to use in the classification

**-md**, **\-\-cl_mode** \<classification mode\>

>Classification criterion to be used. Single character string: *m* to use majority mode classification, *w* to use wKNN classification and *d* to use dwKNN classification.

**-rk**, **\-\-rank** = 'genus'

>Taxonomic rank used for attribute selection. Default value is *genus*.

**-o**, **\-\-out_file**

>OPTIONAL. File name for the generated results. If none is provided, the filename is generated from datetime and the query file name.

**\-\-keep_tmp**

>Toggles deletion of temporal files. If provided, the generated temporal files are kept, otherwise, the temporal directory is wiped.