Skip to content

Check data validity

Vinh Tran edited this page Jan 24, 2023 · 4 revisions

Normally all data come together with fDOG and data resulted from fdog.addTaxon or fdog.addTaxa are ready to use. However, if you manually add taxa into fDOG, you should check for their validity by running this command:

fdog.checkData [-h] [-s SEARCHTAXA_DIR] [-c CORETAXA_DIR] [-a ANNOTATION_DIR] [--replace] [--delete] [--concat] [--reblast]

This script will check for:

  • valid folder name (must not contain PIPE, space or some other special characters)
  • valid fasta file (no long fasta header, no space/tab allowed, no special characters or numbers in the sequences, each sequence must be written in single line)
  • compatibility between BLAST DBs and current version of BLAST tool
  • missing annotations (all taxa present in genome_dir and blast_dir must have annotations in weight_dir)
  • missing or duplicated NCBI taxonomy IDs

You will have options to process the fasta files if they are not in the right format, such as delete special characters in the sequences, or replace them with "X", or convert multi-line sequences into single-line sequences.