Look into using taxon result limited blastp instead of top-n limited

Taxon assignment is currently performed by running DIAMOND blastp on contig proteins, finding the LCA and majority vote. 

Currently, the blast result simply returns the top 200 database hits which is not ideal if there are many high-scoring matches

That is currently decided here in python:
https://github.com/KwanLab/Autometa/blob/b1ec370c44e7accfc917aaad1a685133537fd915/autometa/common/external/diamond.py#L57-L150
And, once merged, here in Nextflow:
https://github.com/KwanLab/Autometa/blob/7618ee18bb31c4de9af81cf2fc093ff2d1ff0ca2/conf/modules.config#L42

To investigate:

It looks like DIAMOND can build the database with taxon info and limit to 1 hit per species, which is more ideal.

e.g. build... (note this has been done and the taxon_mapped_nr_diamond.dmnd file exists on the server)
```
 diamond makedb \
    --in $DIR_PREFIX/Databases/autometa_databases/nr \
    --db $DIR_PREFIX/Databases/autometa_databases/taxon_mapped_nr_diamond \
    --taxonmap $DIR_PREFIX/autometa_databases/prot.accession2taxid.FULL.gz \
    --taxonnodes $DIR_PREFIX/autometa_databases/nodes.dmp \
    --threads 40
```    

then run `diamonnd blastp --taxon-k 1 ...` ? 

`--taxon-k                maximum number of targets to report per species`

Additionally, this could potentially remove need for a taxid lookup step in Autometa by changing DIAMOND blastp's output to  `--outfmt 6 staxids`



---
Sorta related to https://github.com/KwanLab/Autometa/issues/11

	def blast(
	fasta: str,
	database: str,
	outfpath: str,
	blast_type: str = "blastp",
	evalue: float = float("1e-5"),
	maxtargetseqs: int = 200,
	cpus: int = mp.cpu_count(),
	tmpdir: str = None,
	force: bool = False,
	verbose: bool = False,
	) -> str:
	"""
	Performs diamond blastp search using query sequence against diamond formatted database

	Parameters
	----------
	fasta : str
	Path to fasta file having the query sequences. Should be amino acid sequences in case of BLASTP
	and nucleotide sequences in case of BLASTX
	database : str
	Path to diamond formatted database
	outfpath : str
	Path to output file
	blast_type : str, optional
	blastp to align protein query sequences against a protein reference database,
	blastx to align translated DNA query sequences against a protein reference database, by default 'blastp'
	evalue : float, optional
	cutoff e-value to count hit as significant, by default float('1e-5')
	maxtargetseqs : int, optional
	max number of target sequences to retrieve per query by diamond, by default 200
	cpus : int, optional
	Number of processors to be used, by default uses all the processors of the system
	tmpdir : str, optional
	Path to temporary directory. By default, same as the output directory
	force : bool, optional
	overwrite existing diamond results, by default False
	verbose : bool, optional
	log progress to terminal, by default False

	Returns
	-------
	str
	Path to BLAST results

	Raises
	------
	FileNotFoundError
	`fasta` file does not exist
	ValueError
	provided `blast_type` is not 'blastp' or 'blastx'
	subprocess.CalledProcessError
	Failed to run blast
	"""
	if not os.path.exists(fasta):
	raise FileNotFoundError(fasta)
	if os.path.exists(outfpath) and not force:
	if os.path.getsize(outfpath):
	if verbose:
	logger.warning(f"FileExistsError: {outfpath}. To overwrite use --force")
	return outfpath
	blast_type = blast_type.lower()
	if blast_type not in ["blastp", "blastx"]:
	raise ValueError(f"blast_type must be blastp or blastx. Given: {blast_type}")
	if verbose:
	logger.debug(f"diamond {blast_type} {fasta} against {database}")
	cmd = [
	"diamond",
	blast_type,
	"--query",
	fasta,
	"--db",
	database,
	"--evalue",
	evalue,
	"--max-target-seqs",
	maxtargetseqs,
	"--threads",
	cpus,
	"--outfmt",
	"6",
	"--out",
	outfpath,
	]
	if tmpdir:
	cmd.extend(["--tmpdir", tmpdir])
	# this is done as when cmd is a list each element should be a string
	cmd = [str(c) for c in cmd]
	if verbose:
	logger.debug(f'RunningDiamond: {" ".join(cmd)}')
	subprocess.run(
	cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, check=True
	)
	return outfpath

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look into using taxon result limited blastp instead of top-n limited #322

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Look into using taxon result limited blastp instead of top-n limited #322

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions