-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Taxon assignment is currently performed by running DIAMOND blastp on contig proteins, finding the LCA and majority vote.
Currently, the blast result simply returns the top 200 database hits which is not ideal if there are many high-scoring matches
That is currently decided here in python:
Autometa/autometa/common/external/diamond.py
Lines 57 to 150 in b1ec370
| def blast( | |
| fasta: str, | |
| database: str, | |
| outfpath: str, | |
| blast_type: str = "blastp", | |
| evalue: float = float("1e-5"), | |
| maxtargetseqs: int = 200, | |
| cpus: int = mp.cpu_count(), | |
| tmpdir: str = None, | |
| force: bool = False, | |
| verbose: bool = False, | |
| ) -> str: | |
| """ | |
| Performs diamond blastp search using query sequence against diamond formatted database | |
| Parameters | |
| ---------- | |
| fasta : str | |
| Path to fasta file having the query sequences. Should be amino acid sequences in case of BLASTP | |
| and nucleotide sequences in case of BLASTX | |
| database : str | |
| Path to diamond formatted database | |
| outfpath : str | |
| Path to output file | |
| blast_type : str, optional | |
| blastp to align protein query sequences against a protein reference database, | |
| blastx to align translated DNA query sequences against a protein reference database, by default 'blastp' | |
| evalue : float, optional | |
| cutoff e-value to count hit as significant, by default float('1e-5') | |
| maxtargetseqs : int, optional | |
| max number of target sequences to retrieve per query by diamond, by default 200 | |
| cpus : int, optional | |
| Number of processors to be used, by default uses all the processors of the system | |
| tmpdir : str, optional | |
| Path to temporary directory. By default, same as the output directory | |
| force : bool, optional | |
| overwrite existing diamond results, by default False | |
| verbose : bool, optional | |
| log progress to terminal, by default False | |
| Returns | |
| ------- | |
| str | |
| Path to BLAST results | |
| Raises | |
| ------ | |
| FileNotFoundError | |
| `fasta` file does not exist | |
| ValueError | |
| provided `blast_type` is not 'blastp' or 'blastx' | |
| subprocess.CalledProcessError | |
| Failed to run blast | |
| """ | |
| if not os.path.exists(fasta): | |
| raise FileNotFoundError(fasta) | |
| if os.path.exists(outfpath) and not force: | |
| if os.path.getsize(outfpath): | |
| if verbose: | |
| logger.warning(f"FileExistsError: {outfpath}. To overwrite use --force") | |
| return outfpath | |
| blast_type = blast_type.lower() | |
| if blast_type not in ["blastp", "blastx"]: | |
| raise ValueError(f"blast_type must be blastp or blastx. Given: {blast_type}") | |
| if verbose: | |
| logger.debug(f"diamond {blast_type} {fasta} against {database}") | |
| cmd = [ | |
| "diamond", | |
| blast_type, | |
| "--query", | |
| fasta, | |
| "--db", | |
| database, | |
| "--evalue", | |
| evalue, | |
| "--max-target-seqs", | |
| maxtargetseqs, | |
| "--threads", | |
| cpus, | |
| "--outfmt", | |
| "6", | |
| "--out", | |
| outfpath, | |
| ] | |
| if tmpdir: | |
| cmd.extend(["--tmpdir", tmpdir]) | |
| # this is done as when cmd is a list each element should be a string | |
| cmd = [str(c) for c in cmd] | |
| if verbose: | |
| logger.debug(f'RunningDiamond: {" ".join(cmd)}') | |
| subprocess.run( | |
| cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, check=True | |
| ) | |
| return outfpath |
And, once merged, here in Nextflow:
Line 42 in 7618ee1
| ext.args = '--evalue 1e-5 --max-hsps 1 --max-target-seqs 200 -b 6 --outfmt 6' |
To investigate:
It looks like DIAMOND can build the database with taxon info and limit to 1 hit per species, which is more ideal.
e.g. build... (note this has been done and the taxon_mapped_nr_diamond.dmnd file exists on the server)
diamond makedb \
--in $DIR_PREFIX/Databases/autometa_databases/nr \
--db $DIR_PREFIX/Databases/autometa_databases/taxon_mapped_nr_diamond \
--taxonmap $DIR_PREFIX/autometa_databases/prot.accession2taxid.FULL.gz \
--taxonnodes $DIR_PREFIX/autometa_databases/nodes.dmp \
--threads 40
then run diamonnd blastp --taxon-k 1 ... ?
--taxon-k maximum number of targets to report per species
Additionally, this could potentially remove need for a taxid lookup step in Autometa by changing DIAMOND blastp's output to --outfmt 6 staxids
Sorta related to #11