# Graboid documentation
## Preprocessing

### Windows
This module is used to select segments of the alignment matrix, filter by thresholds of empty rows/columns and extract effective (unique) sequences, as well as their corresponding taxonomies.

#### Functions
* **filter_matrix(matrix, thresh = 1, axis = 0)** Filters columns (*axis* = 0) or rows (*axis* = 1) in the given *matrix*, by a given *thresh* of empty values. Returns the indexes of the filtered cells.

###### Collapser functions (OLD)
<ins> Al of this will go away once the new functions are tested</ins>
* **build_cons_tax(subtab)** Generates a consensus row from a given taxonomic *subtab* containig the taxonomies of all the members of a given effective *cluster* <ins>Called by *collapse_1*</ins>
* **get_ident(seq0, seq1)** Returns True if *seq0* == *seq1*. False otherwise. <ins>Called by *get_effective_seqs_3*, *get_effective_clusters*, *get_ident_matrix*</ins>
* **build_roadmap(matrix)** Builds a map of the positions of each value in each column of the matrix <ins>Called by *collapse_0*</ins>
* **build_nodes(seq, idxs, roadmap)** Recursively builds nodes to locate the effective sequences <ins>Called by *collapse_0*</ins>
* **collapse_0** Returns effective sequences and the indexes of every cluster<ins>Called by *collapse_1*</ins>
* **get_ident_matrix(eff_seqs)** Builds a matrix with the pairwise identity between the effective sequences <ins>Called by *crop_effectives*</ins>
* **get_shscore(seq0, seq1)** Calculates shared score between identical sequences. Returns True if *seq0* is more complete than *seq1*, False otherwise. <ins>Called by *compare_ident*</ins>
* **compare_ident(ident, matrix)** Gets the shared score for every pair of sequences given in *matrix*. <ins>Called by *crop_effectives*</ins>
* **get_winners(nseqs, pairs, scores)** Returns indexes of sequences with higher shared score (to keep)<ins>Called by *crop_effectives*</ins>
* **crop_effectives(effective_seqs, effective_idxs)** Remove redundant sequences from the effective sequence cluster. <ins>Called by *collapse_1*</ins>
* **collapse_1(matrix, tax_tab)** Direct construction of the collapsed matrix and taxonomy table

###### Collapser functions (NEW)
* **build_effective_matrix(eff_idxs, matrix)** Uses the list of indexes for each cluster (*eff_idxs*) to create a consensus sequence for each one from the data contained in *matrix*. Returns *effective_matrix*
* **build_effective_taxonomy(eff_idxs, tax_tab)** Uses the list of indexes for each cluster (*eff_idxs*) to create a consensus taxonomy table from *tax_tab*. When a conflict is found within a cluster, the taxon assigned to the current rank and all ranks further below is the last unconlficting taxon.
* **collapse_window(matrix, tax_tab)** Identical to the *collapse_window* method of the class *Window*, extracted to be used outside the class.

<ins>NOTE: these aren't tested yet. Should be included in *Window.process_window*</ins>

#### Tree
**class Tree()**

This class is used to collapse the effective sequences in a given window. Called by the *process_window* method of the *Window* class.
##### Attributes
* **leaves** List containing the indexes of the found effective sequences. Generated by the *build* method.

##### Methods
* **build(matrix)** Takes the given window (*matrix*) and constructs a tree composed of *Node* instances to find the effective sequences. Only the indexes contained in the leave nodes are kept.

#### Node
**class Node(lvl, value, row, indexes, matrix, tree)**

Class used to collapse the effective sequences present in the given *matrix*. Recursively creates children nodes upon initialization, stopping when the end of the matrix is reached. NOTE: the passed matrix should be transposed in the first node.

##### Parameters
* **lvl**
* **value**
* **row**
* **indexes**
* **matrix**
* **tree**

##### Attributes
* **lvl** Column number of the node instance
* **row** Rows belonging to the node's branch in the current column
* **indexes** Indexes of the node's branch in the current column
* **matrix** Window to be collapsed
* **children** List containing the node's children. Constructed upon initialization
* **tree** *Tree* instance containing the tree. When the window end is reached, pass the leave node's indexes to the tree's *leaves* attributes

##### Methods
* **get_children()** Defines the children nodes in function of the values present in the current row. Locates the global indexes for the found values and instantiates a children node.

#### WindowLoader
**class WindowLoader(logger = logger)**

This class is used to extract and preprocess a given segment of the alignment matrix, along with the corresponding taxonomy table and accession list.

##### Parameters
* **logger** Parent logger to be used. Passed upon initialization because this class is used by multiple other modules.

##### Attributes
* **logger** *logging.Logger* instance
* **mat_file** File containing the alignment matrix
* **acc_file** File containing the accession list for the alignment matrix
* **tax_file** File containing the taxonomy table for the alignment matrix
* **matrix** Numpy array storing the alingment matrix
* **dims** Matrix dimensions
* **acclist** Accession list
* **tax_tab** Taxonomy table

##### Methods
* **set_files(mat_file, acc_file, tax_file)** Loads the matrix, accession and taxonomy files for a given alignment
* **get_window(start, end, row_thresh = 0.2, col_thresh=0.2)** Selects the window delimited by the columns *start* and *end* and returns a *Window* instance filtered by *row_thresh* and *col_thresh*

#### Window
**class Window(matrix, start, end, row_thresh = 0.2, col_thresh = 0.2, loader = None)**

This class contains the selected window and can be use it to filter out incomplete rows/columns and collapse effective sequences.

##### Parameters
* **matrix**
* **start**
* **end**
* **row_thresh**
* **col_thresh**
* **loader**

##### Attributes
* **matrix** Segment of the alignment matrix passed as a window
* **start** Starting position of the window
* **end** Ending position of the window
* **loader** *WindowLoader* instance that generates the window. Used to retrieve the taxonomy table
* **shape** Tumple containing the dimensions of the window
* **row_thresh** Maximum proportion of empty rows allowed per column
* **col_thresh** Maximum proportion of empty columns allowed per row
* **rows** Indexes rows of *matrix* selected to compose *window*
* **cols** Indexes columns of *matrix* selected to compose *window*
* **window** Matrix generated after filtering *matrix* by the given thresholds
* **tax_tab** Taxonomy table of the filtered *window*. Retrieved using *self.rows*
* **eff_idxs** Indexes of the collapsed effective sequences in *window*
* **eff_mat** Effective matrix
* **eff_tax** Consensus taxonomy built for the effective matrix
* **cons_mat** DEPRECATED
* **cons_tax** DEPRECATED


##### Methods
* **process_window(row_thresh, col_thresh)** Apply the given *row_thresh* and *col_thresh* to filter *matrix* and generate *window*. Collapse the effective sequences and retrieve consensus taxonomy calling *collapse_window*
* **collapse_window()** Creates a *Tree* instance to collapse the *window* and retrieve the indexes for every cluster of effective sequences. Extracted indexes are stored in attribtue *eff_idxs*. Builds *eff_mat* and *eff_tax*, storing the effective matrix and consensus taxonomy

### Feature Selection
This module handles entropy calculation for a given window of the alignment.

#### Functions
* **get_entropy(array)** Calculates the Shannon entropy for a given column (*array*)
* **get_matrix_entropy(matrix)** Calculates the entropy for the given *matrix*. Adjust entropy using Jorge's equation (2 - entropy) / 2. 1 = min entropy, 0 = max entropy
* **pte(matrix, tax_tab)** <ins>Will replace *per_tax_entropy*</ins>. Returns a multiindexed dataframe (rank, taxon) with the entropy for each base
* **per_tax_entropy(matrix, tax_tab)** Calculates entropy per taxon for each rank in the given *tax_tab*. Returns data frame *ent_tab* containing the per-site entropy of each taxon in every rank, plus a rank column
* **get_ent_diff(matrix, tax_tab)** Returns an entropy difference dataframe for each taxon with multiindex (rank, taxon)
* **get_gain(matrix, tax_tab)** Calculates information gain for each site/taxon/rank
* **plot_gain(table, rank, criterium)** Create a barplot for the entropy difference at the given rank

#### Selector
**class Selector(matrix, tax)**

This class takes a *matrix* and calculates the amount of information contained in each of its columns.
##### Parameters
* **matrix** Window of the alignment matrix
* **tax** Taxonomy table corresponding to the given *matrix*

##### Attributes
* **matrix**
* **tax**
* **ranks** List of ranks represented in the taxonomy table
* **diff_tab** Entropy difference table. Dataframe with multiindex (rank, taxon)
* **order_tab** Table containing the (ascending) ordered position of the indexes in each row of *diff_tab*
* **selected_tax** Dictionary of the selected taxons for each *rank*
* **selected_seqs** Dictionary of indexes of the selected sequences for each *rank*
* **selected_rank** Rank used to build the last selected matrix
* **selected_sites** Array of indexes of the selected sites for a given *rank*

##### Methods
* **build_diff_tab()** Generate the entropy difference table for *matrix*. If a filter has been applied, use filtered matrix. Store result in *diff_tab*. Generate *order_tab* containing the ordered indexes of *diff_tab*
* **select_sites(nsites, rank)** Select the *nsites* most informative sites for the given *rank*. Store results in *selected_sites*. Return a cropped matrix and the corresponding rows of the taxonomy table. <ins>NOTE: this method should run AFTER *generate_diff_tab*</ins>
* **select_taxons(ntaxes, minseqs, thresh)** Apply thresholds to the taxons represented in *matrix*. If *ntaxes* is given, select the *ntaxes* more populated taxons for each rank. If *minseqs* is given, select only taxons with more than *minseqs* sequences, if *thresh* is given, select only taxons that represent a percentage of the total greater than *thresh*. Results are stored in attributes *selected_tax*, containing the selected taxons per *rank* and *selected_seqs* containing the indexes of the sequences blonging to a selected taxon per *rank*
* **get_training_data(rank)** Returns the submatrix containing the selected sequences and columns for the given *rank*. Also returns the corresponding taxonomy table.

### Taxon study
Study the internal variation of a taxonomic cluster of sequences

#### Functions
* **build_tax_clusters(matrix, tax_tab, rank, dist_matrix = cost)** Builds a dictionary of TaxId:*TaxCluster* for each taxon present in *tax_tab*.
* **get_paired_dists(matrix, dist_matrix = cost)** Calculates the paired distances between the sequences contained in *matrix* using a distance matrix. Distance data is located above the diagonal.
* **get_flat_dists(matrix)** Returns the flattened top right half of the distance matrix. Used by *TaxCluster* to get the average distance within the cluster.
* **get_rowcol(mat, idx)** Takes a paired distance matrix (*mat*) and an index (*idx*) returns the distance values for item number *idx* of the matrix.

#### clust_iterator
#### SuperCluster
**class SuperCluster(matrix, tax_tab, rank, dist_matrix = cost)**

This class contains all the taxon clusters in the selected data. Acts as an iterable.
##### Parameters
* **matrix** Sequence matrix
* **tax_tab** Taxonomy table for the sequence matrix
* **rank** Taxonomic rank to cluster by
* **dist_matrix** Distance matrix to be used in distance calculations

##### Attributes
* **clusters** Dictionary of the form *TaxID*:*TaxCluster* containing a *TaxCluster* instance for each taxon in the specified *rank*
* **tax_list** List of taxon IDs
* **centroids** Dictionary of the form *TaxID*:Array containing the centroid sequence for each taxon
* **centroid_dists** Paired distance matrix between the *centroids*

#### TaxCluster
**class TaxCluster(matrix, taxid, rank, dist_matrix = cost)**

This class holds the sequences of a unique taxon. Used to generate relevant data (paired distances, dispersion, centroids) and collapse unique sequences at the taxon level
##### Parameters
* **matrix** Matrix containing the sequences of the cluster
* **taxid** Taxon Id
* **rank** Taxonomic rank of the cluster
* **dist_matrix** Distance matrix to use in distance calculations

##### Attributes
* **matrix**
* **paired** Paired distances between the sequences in *matrix*
* **taxid** Taxon ID
* **rank** Taxonomic rank
* **nseqs** Number of sequences in *matrix*
* **mean** Mean paired distance within the cluster
* **std** Standard deviation of distances within the cluster
* **max_range** Maximum distance between two members of the cluster
* **means** Average distance of each sequence to the rest of the cluster
* **centroid** Sequence with the lowest mean distance to the rest of the cluster

##### Methods
* **get_mean_dists()** Calculate the mean distance of every member of the cluster. Determine the centroid sequence. Update *means* and *centroid*
* **get_params()** Calculate mean paired distance within the cluster (*mean* and *std*), maximum paired distance (*max_range*) and determine the centroid (*means* and *centroid*)