# Graboid documentation

## Database

### Survey
El módulo Survey se encarga de explorar bases de datos de secuencias en busca de registros correspondientes a un par taxón/marcador especificado por el usuario. La búsqueda es realizada de manera automática por la clase Surveyor, la cual realiza la búsqueda cruzada en la base de datos y descarga la lista de códigos de acceso a un archivo temporal. En caso de que la conección se interrumpa durante la descarga, el programa puede reintentar la búsqueda un número predeterminado de veces (el número de intentos por defecto es 3).
Actualmente las bases de datos válidas son NCBI y BOLD.
Dado que la descarga de datos de la base de datos de BOLD es lenta independientemente del contenido descargado, es conveniente combinar el paso de búsqueda con el de descarga de secuencias y taxonomías para esta base de datos.

#### Surveyor
**class Surveyor(out_dir)**

This class handles the automatic survey for the given taxon and marker. Survey results are stored in *out_dir* using a file name of the form *\<taxon>_\<marker>_\<database>.summ*
##### Parameters
* **taxon** Taxon to search for, depending on the experiment it is recommended to use a high taxonomic rank, such as Phylum or class
* **marker** Marker sequence to search for
* **out_dir** Output directory for the generated summary files

##### Attributes
* **tooldict** Class attribute used to identify the survey tool to be used. Key:Value pairs are of the form *\<database>*:*\<SurveyTool>*
* **out_dir**
* **out_files = {}** Dictionary of generated summary files. Key:Value pairs are of the form *\<database>*:*\<summary file>*

##### Methods
* **survey(taxon, marker, database, max_attempts = 3)** Searches *database* for the given *taxon* and *marker* records. Makes up to *max_attempts* attempts. Updates the out_files dictionary upon success.

#### SurveyTool
**class SurveyTool(taxon, marker, out_dir)**
Abstract class that acts as a template for the survey tools specific for each database
##### Parameters
* **taxon** Taxon to search for, depending on the experiment it is recommended to use a high taxonomic rank, such as Phylum or class
* **marker** Marker sequence to search for
* **out_dir** Output directory for the generated summary files

##### Attributes
* **taxon**
* **marker**
* **database** Database the tool connects to
* **out_file** Path onto which the survey results will be saved. File name of the form *\<taxon>_\<marker>_\<database>.summ*
* **attempt = 1** Attempt counter. Resets every time the *survey* method is called
* **max_attempts = 3** Attempt limit. Can be reset by the *survey* method
* **done = False** Signals successful survey

##### Methods
* **generate_outfile(out_fir)** Called by the constructor to generate the *out_file* attribute
* **survey(max_attempts = 3)** Performs the survey. The argument *max_attempts* overwrites the instance attribute

#### SurveyWAPI(SurveyTool)
Abstract class used as a template for survey tools that connect to databases with a defined search API. Inherits *SurveyTool* methods and attributes

##### Methods
* **attempt_dl()** Called by the *survey* method. Attempts to connect to the database and download the records using the site's API up to *max_attempts* times. If successful, results are stored in *out_file* and the *done* attribute is set to True

#### SurveyBOLD(SurveyWAPI)
Survey tool specific for the BOLD database. Downloaded records are in the form of a CSV file containing accessions from different repositories, taxonomic assignments and IDs (specific to BOLD) and the sequence data (TODO). Inherits from *SurveyTool*
##### Methods
* **get_dbase()** Returns 'ENA', used by *generate_outfile* method
* **get_url()** Generates the API url to be used by *attempt_dl*

#### SurveyENA(SurveyWAPI)
Survey tool specific for the ENA database. Given that ENA completely intersects with NCBI, this tool shouldn't be used.
##### Methods
* **get_dbase()** Returns 'BOLD', used by *generate_outfile* method
* **get_url()** Generates the API url to be used by *attempt_dl*

#### SurveyNCBI(SurveyTool)
Survey tool specific for the NCBI database. The search is done using Biopython's Entrez module. Retrieves accession codes from consistent records.
##### Methods
* **get_dbase()** Returns 'NCBI', used by *generate_outfile* method
* **attempt_dl()** Equivalent to the method defined in *SurveyWAPI*, connects to the NCBI database using the Entrez module instead of a generated API url.

### List
The *lister* module generates a consensus between the summary files retrieved from multiple databases. Whenever a record is present in multiple summaries, the program leaves picks a single record, prioritizing the NCBI database whenever possible, and discards the rest. The resulting consensus is stored as a CSV file containing a table with the columns **Accession**, **Version** and **Database**, with a name of the form *\<taxon>_\<marker>.acc*

#### Functions
* **detect_summ(file_list)** Given a list of summary files, *file_list*, identify the database to which each belongs. Works under the assumption that there is a single file per database. Current valid databases are BOLD and NCBI
* **read_BOLD_summ(summ_file)**
* **read_NCBI_summ(summ_file)**
* **read_summ(summ_file, database)** Reads the given *summ_file* using the correct function for its corresponding *database*
* **get_shortaccs_ver(acc_list)** Splits the accession codes in *acc_list* from their version numbers. If a given code has no version, it is assigned a value of 1. Returns a list of cropped accessions and one of version numbers
* **build_acc_subtab(acc_list, database)** Uses *acc_list* to build a table with index = short accession (no version number) and columns = accession, version number and *database*
* **clear_repeats(merged_tab)** Locates and clears repeated records in a merged accession table, prioritizing NCBI records when possible

#### Lister
**class Lister(out_dir)**

This class compares multiple summary files generated by a *Surveyor* instance and generates a consensus table. For compatibility with the *director* module, class instances get summary files after construction with the *get_summ_files* method.
##### Parameters
* **out_dir** Output directory for the generated accession table

#### Attributes
* **out_dir**
* **out_file** Name of the generated accession table, of the form *\<taxon>_\<marker>.acc*
* **summ_files = {}** Dictionary containing the path to the summary for each database. Key:Value pairs are of the form *\<database>*:*\<summary_file>

##### Methods
* **get_summ_files(summ_files)** Updates the *summ_files* attribute. DEPRECATED
* **generate_outfile()**
* **build_list(summ_files)** Generates the consensus table from the passed summary files (*summ_files*)

### Fetcher
This module handles acquiring sequences and taxonomy data from the NCBI database and preparing the BOLD summaries for integration. Temporal sequence and taxonomy files are generated following the naming convention *\<taxon>_\<marker>_\<database>.seqtmp* and *\<taxon>_\<marker>_\<database>.taxtmp*. Records that faiil to be downloaded are registered in an auxiliar table (named *\<taxon>_\<marker>_failed.acc*) for examination or further attempts.

#### Functions
* **set_entrez(email, apikey)** The Entrez API requires a valid e-mail and API key (generated at the [NCBI website](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/))
* **acc_slicer(acc_list, chunksize)** Splits an accession list into chunks of *chunksize* elements, for internal use in the module
* **fetch_seqs(acc_list, database, out_header, chunk_size = 500, max_attempts = 3)** This function is called by the *fetch* method of the *Fetcher* class to do a single pass over a list of accessions. The list is split into chunks of *chunksize* elements and the program attempts to download each one up to *max_attempt* times. The function returns a list of the accessions that failed to download after *max_attempts*
* **get_accs_from_fasta(fasta_file)** Used to retrieve accession codes when a pre-built fasta file is provided

#### Fetcher
**class Fetcher(out_dir)**

This class is used to process the contents of a summary table generated by a *Lister* instance. The class checks the summary table file generated by a Lister instance and attempts to retrieve the sequence and taxonomic data for each entry from the corresponding database. Failed accessions are recorded in a table with the same format of the summary table, allowing the user to attempt to download them at a later time.
##### Parameters
* **out_dir** Target directory for the generated files

##### Attributes
* **acc_file** Summary file generated by a *Lister* instance, generated using the *load_accfile* method.
* **out_header** Prefix for the generated files, of the form *\<out_dir>/\<taxon>_\<marker>*
* **failed_file** File name for the failed accessions file, of the form *\<taxon>_\<marker>_failed.acc*
* **acc_tab** Dataframe containing Accession, Version and Database of each record
* **out_dir**
* **seq_files** Dictionary containing the temporal sequence files generated for each database
* **tax_files** Dictionary containing the temporal taxID files generated for each database
* **out_taxs** File name for the taxIDs file generated by the *fetch_tax_from_fasta* method. *Named \<fasta_file>.taxtmp*

##### Methods
* **load_accfile(acc_file)** Opens the given *acc_file* and saves contents into *acc_tab* parameter. Also generates *out_header* and *failed_file*
* **check_accs()** Verifies that *acc_tab* is not empty. If it is, generates a warning and returns False
* **fetch(chunk_size = 500, max_attempts = 3)** Fetches sequences and taxonomies for each record in *acc_tab*, splits *acc_tab* into chunks of *chunk_size* elements. For each chunk, attempts to download up to *max_attempts* times. If there are records that fail to download after *max_attempts* attempts, perform a second pass over those records and attempt to incorporate them to the tmp files. Records that fail to download after the second pass are stored in *failed_file*
* **fetch_tax_from_fasta(fasta_file)** Retrieves TaxIDs for the records in a provided fasta file and stores them to a *taxtmp* file analogous to the one generated by the *NCBIFetcher* class *fetch* method.

#### NCBIFetcher
**class NCBIFetcher(out_header)**

This class is in charge of retrieving data from the NCBI database. For each record it is passed, it attempts to download a sequence and a taxonomic ID. Results are dumped into seqtmp and taxtmp files respectively.
##### Parameters
* **out_header** Used to generate the names of the temporal files

##### Attributes
* **out_seqs** File name for the generated sequences file. Of the form *out_header.seqtmp*
* **out_taxs** File name for the generated taxIDs file. Of the form *out_header.taxtmp*

##### Methods
* **fetch(acc_list, chunk_n = 0, max_attempts = 3)** Attempts to download the sequence and taxID of the sequences in *acc_list*. If the sequence set cannot be fully retrieved after *max_attempts* tries, return False. Succesfully downloaded elements are dumped into *out_seqs*, as a fasta file, and *out_taxs*, as a table with columns *Accession* and *TaxID*

#### BOLDFetcher
NOTA: las secuencias de BOLD incluyen gaps ('-') que no le gustan a BLAST. Eliminarlos en este paso
**WIP**

### Taxonomist
This module retrieves the taxonomic data for each record downloaded by a *Fetcher* instance. The taxonomic data consists of a scientific name and its numeric ID for each specified taxonomic range. The module uses a limited set of ranks by default (*Phylum*, *Class*, *Order*, *Family*, *Genus* and *Species*), but this can be adjusted by the user. The taxonomic data is stored into CSV files following the convention *\<taxon>_\<marker>_\<database>.tax*

#### Functions
* **detect_taxidfiles(file_list)** Analogous to *lister.detect_summ*. May define a module for shared functions later
* **tax_slicer(tax_list, chunksize = 500)** Analogous to *fetcher.acc_slicer*
* **extract_tax_data(record)** Extract taxonomic data from a downloaded NCBI record. Returns a dictionary with Key:Values of the form *\<rank>:\<Scientific name>, \<rank>\_id:\<Taxon ID>, ...*

#### Taxonomist
**class Taxonomist(out_dir)**

This class directs taxonomic data retrieval for downloaded records. The class uses *taxtmp* files generated by a *Fetcher* instance to retrieve records.
##### Parameters
* **out_dir** Target directory for the generated taxonomy tables

##### Attributes
* **taxid_files = {}** Dictionary containing the *taxtmp* files
* **ranks = []** List of taxonomic ranks to retrieve. The ranks used by default are *Phylum*, *Class*, *Order*, *Family*, *Genus* and *Species*
* **out_dir**
* **out_files = {}** Dictionary containing generated filenames per database

##### Methods
* **get_taxidfiles(taxid_files)** Retrieve the *taxtmp* to utilize. DEPRECATED
* **check_taxidfiles()** Verify *taxtmp* files are avaliable and usable. moved inside *taxing
* **set_ranks(ranklist = \['phylum', 'class', 'order', 'family', 'genus', 'species'\])** Resets the list of taxonomic ranks to utilize
* **taxing(taxid_files, chunksize = 500, max_attempts = 3)** Retrieve taxonomic data from the given *taxid_files*. The first argument consists of a dictionary with *database*:*taxID file* items. The arguments *chunksize* and *max_attempts* are used when retrieving data from the NCBI database. Generates the corresponding *Taxer* instances and directs them to construct the taxonomic tables by calling their *taxing* methods. Paths to output files are stored in *out_files*

#### Taxer
**class Taxer()**

Abstract class used as a template for the tools specific for each database. Defines shared methods

##### Methods
* **generate_outfiles()** Generates the filename for the results
* **fill_blanks()** Fills missing values in the retrieved taxonomies. The criterion used is to fill a missing value at a  given rank with the most immediate known value of its parent taxons.

#### TaxonomistNCBI(Taxer)
**class TaxonomistNCBI(taxid_file, ranks, out_dir)**

This class handles retrieving taxonomic data for NCBI records. To do so it connects to the NCBI database using the Entrez API and for downloads the full taxonomy corresponding to each record's taxonomic ID. It later selects the specified ranks and generates a table with the scientific name and ID for each taxon.

##### Parameters
* **taxid_file** *taxtmp* file generated for a set of NCBI records
* **ranks** Taxonomic ranks to conserve from the retrieved taxonomies
* **out_dir** Output directory

##### Attributes
* **taxid_file**
* **ranks**
* **out_dir**
* **tax_out** Output filename
* **taxid_list** List of taxonomic IDs extracted from *taxid_file*. Stored as a pandas series with accession codes as indexes
* **taxid_reverse** Reversed *taxid_list* (taxon Ids as indexes, and accession codes as values)
* **uniq_taxs** List of unique taxonomic IDs present in *taxid_list*
* **tax_table** Output table, has a column for the Scientific name and one for the numeric ID for each taxonomic rank. The table index is made up by the accession codes.
* **tax_tab0** Taxonomic table for each unique taxon present in *taxid_list*. Used to build *tax_table*
* **failed** List of records that failed to download

##### Methods
* **read_taxid_file()** Extracts the taxonomic IDs from *taxid_file*, stores them into *taxid_list* as a pandas Series, using the accession codes as index
* **make_tax_tables()** Pregenerates *tax_table* and *tax_tab0*
* **dl_tax_records(tax_list, chunksize = 500)** Attempts to retrieve the records contained in *tax_list* from NCBI. *tax_list* is split into chunks of size *chunksize*. If a chunk fails to download it is stored to *failed* for later retries. Successfully downloaded records are used to update *tax_tab0*
* **retry_dl(max_attempts = 3)** If any records failed to download, performs further attempts. The method keeps trying until there are no failed records or the number of attempts made exceeds *max_attempts*
* **__update_guide(taxids, records)** Used internally by *dl_tax_records* to update *tax_tab0* with a batch of successfully downloaded records
* **update_tables()** Uses *tax_tab0* to fill *tax_table*
* **taxing(chunksize = 500, max_attempts = 3)** Directs the process of retrieving the taxonomic data, filling missing values and building *tax_table*

#### TaxonomistBOLD(Taxer)
**class TaxonomistBOLD(taxid_file, ranks, out_dir)**

This class handles retrieving taxonomic data for BOLD records. As the data retrieved by a *Surveyor* instance already contais taxonomic data for each record. This class only has to extract the already present data and prepare it for merger with the NCBI records.

##### Parameters
* **taxid_file** *taxtmp* file generated for a set of NCBI records
* **ranks** Taxonomic ranks to conserve from the retrieved taxonomies
* **out_dir** Output directory

##### Attributes
* **taxid_file**
* **ranks**
* **out_dir**
* **tax_out** Output filename
* **bold_tab** Dataframe containing the records retrieved from the BOLD database
* **marker_vars** **WIP**
* **tax_tab0** Output table, containing the corresponding rank and rank_ID columns extracted from *bold_tab*. Stored as *tax_tab0* to maintain compatibility with the *fill_blanks* method

##### Methods
* **read_taxid_file()** Extracts the taxonomic IDs from *taxid_file*, stores them into *taxid_list* as a pandas Series, using the accession codes as index
* **__set_marker_vars()** **WIP**
* **get_tax_tab()** Used to extract and rename the necessary columns from *bold_tab* and build *tax_tab0*
* **taxing(chunksize = None, max_attempts = None)** Directs the process of retrieving the taxonomic data, filling missing values and building *tax_table*. Arguments *chunksize* and *max_attempts* are kept for compatibility with *Taxonomist.taxing* method

### Merger
This method unifies the downloaded records and taxonomic data for all databases included in the search

#### Functions
* **detect_files(file_list)** Behaves like *lister.detect_summ*. Given the list *file_list*, identifies the database to which each file belongs. Used when input files are passed as a list instead of a dictionary
* **flatten_taxtab(tax_tab)** Generates a two column table from the *tax_tab*, containing the taxID, rank and parent taxID of each taxon, attibute *tax name* is kept as the index. Assumes the tax tab has 2\**n_taxes* columns (\<*rank*>, \<*rank*>_id)

#### Merger
**class Merger(out_dir)**

This class manages the union of sequence and taxonomy files
##### Parameters
* **out_dir** Target directory for the generated merged files

##### Attributes
* **out_dir**
* **seq_out** File name for the merged sequence file
* **acc_out** File name for the merged accession list
* **tax_out** File name for the merged taxonomy table
* **taxguide_out** File name for the generated TaxID guide

##### Methods
* **get_files(seqfiles, taxfiles)** Locates the given *seqfiles* and *taxfiles*. Shold be given as \<*database*>:\<*filename*> dicts. Calls *detect_files* if necesary
* **generate_outfiles()** Generates output file names for the generated files. <ins> NOTE: add argument for manual naming</ins>
* **merge_seqs()** Merge the given sequence files and build the corresponding combined accession list. Store generated files to *seq_out* and *acc_out*
* **merge_taxons()** Create a *MergerTax* instance to direct the union of the taxonomic data
* **merge(seqfiles, taxfiles)** Main function. Chains all the steps of sequence and taxonomy merger
* **merge_from_fasta(seqfile, taxfile)** Used when a fasta file was provided, generate the corresponding accession list and TaxID guide. Store files to *acc_out* and *taxguide_out*

#### MergerTax
**class MergerTax(tax_files)**

This class handles the merging of taxonomy files, including the unification of taxonomic numeric IDs
##### Parameters
* **tax_files** Dictionary containing \<*database*>:\<*filename*> for each taxonomy table

##### Attributes
* **tax_files**
* **NCBI** Boolean tag, indicating if *NCBI* is present among the given databases
* **tax_tabs** Dictionary contaiing the taxonomy table for each database
* **tax_guides** Dictionary contaiing the TaxID guide table for each database
* **guide_tab** Unified TaxID table

##### Methods
* **load_files()** Loads the files named in *tax_files* and stores the tables in *tax_tabs*
* **build_tax_guides()** Calls *flatten_taxtab* to build the corresponding TaxID guide for each table in *tax_tabs*
* **unify_taxids()** Unifies the taxonomic codes used by the different taxonomy tables. If *NCBI* is true, prioritizes the NCBI codes
* **merge_taxons(tax_out, taxguide_out)** Main fucntion. Unifies taxonomic codes, merges taxonomy tables and builds the TaxID guide. Stores files to *tax_out* and *taxguide_out*

**NOTE: since the taxID guide maps each taxon to its numeric ID, the merged tax table should contain ONLY the taxonomic IDs. Modify the functions that generate the final taxonomy tables, remove columns for the taxon names, keep only taxIDs**

### Director
This module chains every step in the process of constructing the sequence database. It can be used individually or called from the director script.

#### Functions
* **make_dirs(base_dir)** Generates the necesary subdirectories in *base_dir* to contain the generated files. Subdirectory names are *data*, *tmp* and *warnings*
* **fasta_name(fasta)** Extracts the file name from a fasta file, used to name subsequent files.
* **move_file(file, dest, mv = False)** Relocates a given *file* to a given *dest*. If mv is True, the file is deleted from its original location.

#### Director
**class Director(out_dir, tmp_dir, warn_dir)**

This class handles the process of generating a sequence database, from the initial survey to the final merger of retrieved records.

##### Parameters
* **out_dir** Directory for the output sequence, accession, taxonomy and taxonomy guide files
* **tmp_dir** Directory for the temporal files generated along the process
* **warn_dir** Directory for the warning files generated along the process

##### Attributes
* **out_dir**
* **tmp_dir**
* **warn_dir**
* **warn_handler**
* **log_handler**
* **Workers** The workers are instances of the classes needed to perform every step of the database construction
  * **surveyor**
  * **lister**
  * **fetcher**
  * **taxonomist**
  * **merger**
* **seq_file** Path to generated sequence file
* **acc_file** Path to generated accession file
* **tax_file** Path to generated taxonomy file
* **taxguide_file** Path to generated taxonomy guide file

##### Methods
* **clear_tmp()** This method clears the temporal directory. Invoked manually or by the director script
* **set_ranks(ranks)** This method resets the *ranks* attributes in the *taxonomist* worker. Used to customize the taxonomic ranks included in the taxonomy table. NOTE: The ranks used are case insensitive, but must be valid taxonomic ranks.
* **direct_fasta(fasta_file, chunksize = 500, max_attempts = 3, mv = False)** Directs the database construction from a pre-generated fasta file. Retrieves taxonomic records from the NCBI database (BOLD not implemented). Arguments *chunksize* and *max_attempts* determine the number of records to retrieve per pass and the number of attempts per chunk. If the *mv* argument is enabled, the fasta file is moved to the output directory rather than copied.
* **direct(taxon, marker, databases, chunksize = 500, max_attempts = 3)** Constructs a sequence database for the given *taxon*-*marker* pair, retieving records from the given *databases* (BOLD and/or NCBI). Arguments *chunksize* and *max_attempts* determine the number of records to retrieve per pass and the number of attempts per chunk.
* **get_out_files()** Update the ouput file path attributes.