# Graboid documentation
## Classification
This module contains the functions used in KNN classification of sequence data

### Functions
* **calc_distance(seq1, seq2, dist_mat)** Returns the distance between *seq1* and *seq2* using *dist_mat*. <ins>NOTE: moved from *distance* delete that one</ins>
* **get_dists(query, data, dist_mat)** Returns the distance between the sequences in a *query* matrix and the sequences in the *data* matrix using *dist_mat*. <ins>NOTE: when *query* has a single sequence it should be reshape into a 2 dimensional array</ins>
* **wknn(dists)** Returns the weighted support for the given distances (*dists*) using the WKNN equation
* **dwknn(dists)** Returns the weighted support for the given distances (*dists*) using the DWKNN equation
* **softmax(supports)** Calculates the softamx activation given an array of supports. <ins>NOTE: could implement this in the weighted classification</ins>
* **classify(query, data, tax_tab, dist_mat, k = 0, mode = 'mwd', prev_dists = None)** Classifies every sequence in *query* using the sequences in *data* as references. *tax_tab* contains the taxonomic ID's for every reference sequence at each taxonomic rank <ins>NOTE: passed table should be preprocessed to contain only the adequate *rows* in the adequate order, and only the taxID columns</ins>. Distance calculation is performed using the given *dist_mat*. Parameter *k* can be either a single value or a list, by default, the classification is performed using all neighbours (This tends to render the *majority* mode useless). If multiple values of *k* are given, classification is performed for each of them. If *prev_dists* is given, it means the distance calculation for the given sites should be added to previous results. This option is used when utilizing multiple values of *n*. Parameter *mode* specifies the classification mode to be used, the characters in the string determine the modes in which to use the *classifier* function:
    * *m*: use majority vote
    * *w*: use wKNN mode
    * *d*: use dwKNN mode
    
  Returns dictionary with keys *m*, *w* and *d* and the calculated distances for each sequence
* **classifier(neighs, tax_tab, k, mode, distances)** Called by *classify* to assign classification based on calculated distances. Each row in the *neighs* matrix contains the indexes of each neighbour in the reference data ordered by its proximity to a query sequence. *tax_tab* contains the taxonomic data that will be used to classify (only taxIDs). Classification is done using every value provided in *k* (therefore *k* must be an iterable). If *mode* is set to *m*, generate for each row in *neighs* an array containing *row_idx*, *taxonomic rank*, *taxon*, *taxon representatives*, *k*, *mean distance to representatives* and *std distance to representatives*. If *mode* is set to *w* or *d*, the provided distances will be used to calculate the support for each neighbour of each sequence using either the *wknn* or *dwknn* functions. The arrays generated when using *w* or *d* modes contain *row_idx*, *taxonomic rank*, *taxon*, *taxon representatives*, *k*, *mean distance to representatives*, *std distance to representatives* and *total taxon support*. <ins>NOTE: this function combines the use of *classify_majority* and *classify_weighted*, those two can go. Furthermore, *calibration can be used in place of *calibration_classify*, so that one can go as well </ins>
* **get_classification(results)** Parses the winning classification from the results generated by *classify*. Paremeter *results* is the dictionary of keys *m*, *w* and *d* generated by *classify*. From each array extractes the winner row for every query sequence for every *k* value for every taxonomic rank. In mode *m*, the winner is determined as the taxon with the most representatives (highest count) amongst the k nearest neighbours. In modes *w* and *d*, winners are determined as those with the highest support. Weighted results are scored with the *softmax* function. Rows corresponding to the winner taxon(s) are extracted for each query/k/rank combination and stored in a dictionary of keys *m*, *w* and *d*
* **parse_report(report)** Recieves a dictionary generated by *get_classification* and generates a dataframe with columns *idx*, *rank*, *taxon*, *count*, *k*, *mean distance*, *std distance*, *mode*, *support*, *score*

**Deprecated functions**
___
* **get_neighs(query, data, dist_mat)** Get *query*'s neighbours in *data* ordered by their proximity to *q*. Also returns the sorted distances, calculated using *dist_mat* <ins> NOTE: Deprecated, this function's funtion is fulfilled by classify</ins>
* **calibration_classify(q, k_rankge, data, tax_tab, dist_mat, q_name = 0)** Calibration of a single instance using a range of neighbours and all classification methods. Argument *q* is the query instance, *k_range* is the range of neighbours to utilize, *data* is the reference sequence matrix, *tax_tab* is the taxonomy table for the reference data, *dist_mat* is the distance matrix to be used in the classification, *q_name* is the numeric value of the query, used to organize results. Returns three arrays *maj_resutls*, *wknn_results*, and *dwknn_results* containing the classification results generated for each method
* **classify_majority(neighs, tax_tab, q_name = 0, total_k = 1)** Classify a query instance selecting for each rank the most represented taxon amongst the given *neighs*. Returns an array with columns *q_name*, *rank*, *taxon*, *max value*, *total_k*
* **classify_weighted(neighs, supports, tax_tab, q_name = 0, total_k = 1)** Classify a query instance using the weighted *supports* of the given *neighs*. Taxonomy for the provided neighbours is given in *tax_tab*. Argument *q_name* is used to name the query in the result table. Argument *total_k* indicates the number of neighbours considered. Returns an array with columns *query name*, *rank*, *taxon*, *representative count*, *total_k*, *total tax support*, *mean taxon support*, *std taxon supports*, detailing the support for each rank in each taxon
* **get_classif(results, mode = 'majority')** Gets a classification from the given result table. Argument *mode* specifies the classification method used to generate the result, values are *majority* and *weighted*
* **get_classif_majority(results, n_ranks = 6)** Get the classification from a majority vote result table. Assign the most represented taxon for each rank, if there is a draw, leave the classification ambiguous. Returns an array with the assigned taxon for each rank
* **get_classif_weighted(results, n_ranks = 6)** Get the classification from a weighted vote result table. Assign the most supported taxon for each rank, if there is a draw, leave the classification ambiguous. Returns an array with the assigned taxon for each rank <ins>NOTE: could add the softmax support for the assigned classification</ins>

### Cost matrix
This module contains the functions used to generate the cost matrixes used in distance calculations
#### Functions
* **pair_idxs(bases0, bases1)** Called by *cost_matrix*, used to calculate distances between ambiguous bases
* **cost_matrix(transition = 1, transversion = 2)** Generates a distance matrix based on the K2P substitution model. Arguments *transition* and *transversion* determine how these substitutions are penalized
* **id_matrix()** Generates an ID matrix with diagonal values 0 (except cell 0,0) and all else are 1

### Director
This module directs the classification of query sequences of unknown taxonomy
#### Functions
* **get_taxonomy(taxa, taxguide)** Given a *taxa* list, reconstruct the taxonomy for each taxon, retrieving parent taxon IDs from the given *taxguide*. Return a dictionary of elements taxID:\[\<taxonomy IDs (ascending)>\]. Also returns a list of all given taxa not found in the taxguide table
* **get_best_params(subreport, metric = 'F1_score')** Takes a subsection of a calibration report. Return average score of *metric* for each combination of parameters (*w_start*, *K* and *n_sites*) represented in *subreport*
* **get_valid_windows(windows, overlap, crop = True)** Filter out windows not included in the space in which the query and reference matrixes overlap. If *crop* is True, windows that overlap partially are cropped, otherwise they are discarded

#### Director
**class Director(out_dir, tmp_dir)**
This class directs the mapping and classification of an input sequence file.

##### Attributes
* **out_dir** Directory to which the results will be saved
* **tmp_dir** Directory to which the temporal files will be saved
* **mat_file** Path to the reference matrix file
* **acc_file** Path to the reference accession file
* **tax_file** Path to the reference taxonomy file
* **report** Calibration report file
* **w_len** Window length used in the calibration (inferred by the *set_report* method)
* **w_step** Window step used in the calibration (inferred by the *set_report* method)
* **taxa** List of taxa to look forward to
* **fasta_file** Path to the last query file fed to the *map_query* method
* **query_blast** Path to the last blast report generated by the *map_query* method
* **query_map** Path to the last alignment matrix generated by the *map_query* method
* **query_accs** Path to the last accession list generated by the *map_query* method
* **windows**
* **params**
* **query_blast**
* **query_map**
* **result**
* **mapper** Mapper instance used to generate an alignment matrix for the query sequences
* **loader** *WindowLoader* instance used to select window fragments from the reference matrix
* **selector** *Selector0* instance used to select the informative sites required for classification

##### Methods
* **set_reference(mat_file, acc_file, tax_file)**
* **set_db(db_dir)** Set the given blast database directory on the *mapper* instance. Should be the same datbase used to generate the reference alignment
* **set_report(report_file)**
* **set_taxa(taxa)** Set a given set of taxa to prioritize in the classification
* **map_query(fasta_file, db_dir, threads = 1)** Build an alignment matrix for the query sequence file. Uses *mapping* module
* **get_windows(metric = 'F1_score', min_overlap = 0.9)** Selects the best window to be used for each query sequence based on the calibration report using the chosen *metric*. For a window to be considered, it must over lap with at least a fraction of *min_overlap* with a given match. Updates the *windows* attribute with a dataframe with columns *w_start*, *w_end*, *K*, and *n_sites* for each query sequence.
* **hint_params(w_start, w_end, metric = 'F1_score')** Get the best parameter sets for the windows defined by coordinates *w_start* and *w_end*, determined by the highest score for the given *metric*. If a set of *taxa* has been defined, establish the parameters that yeld the best results for each taxon in each window (if the taxon is present).
* **get_overlap()** Gets the coordinates of the overlapping section between the query and reference alignments (if there is any. If there is an overlap, returns True and the overlap coordinates, otherwise returns False and None
* **classify_manual(w_start, w_end, k, n, mode = 'mwd', crop = True)** Performs a classification using the given parameters. If a single value for *w_start* is given, it and *w_end* are taken as the coordinates of the sinlge window to be used. Otherwise, each *w_start* value is taken as a window starting position and *w_end* is taken as the window length. Parameters *k* and *n* can be either single values or ranges. If multiple values are provided, each combination is tested. <ins>NOTE: if multiple *n* values are given, call the *get_sites* method of the *Selector* instance. This generates a dictionary with the sites that are incorporated with each incremental value of n. For each iteration, the distance is calculated using only the new additions, and the results of the previous iteration are added afterwards. This cumulative method should reduce computation time via the elimination of redundant calculations</ins>. For each value of *n* classify trying every given *k* value and every given *mode*. The *mode* argument represents the distance calculation modes to be used (*m*: majority, *w*: wKNN, *d*: dwKNN). Argument *crop* determines what is done with the call to *get_valid_windows*
* **classify_with_params()**