So I've got an idea I'm going to test here.  

MICA works in two steps, it first compares sequences to a clustered database, it then creates a customized database of sequences from those clusters that will be used for a a fine-scaled BLAST search.  IF similar sequences are included in the query, I'd expect the whole workflow to perform more quickly since the final BLAST database used is smaller, due to shared entries, compared to if query sequences are all different.

Tests:
time mica and blast run on a random set of ORFs, compared to a cluster of similar ORFs

**Constructing the query files to test**

***Test SAGs***  
*4 SAR11 SAGs that form their own genomic cluster according to MASH*
* AG-319-O18
* AG-905-E11
* AG-896-D15
* AG-904-K13

In [15]:
sags = '''
AG-359-O18
AG-905-E11
AG-896-D15
AG-904-K13
'''.split()

Create one large fasta file with orfs from all four SAGs in it:

In [19]:
from scgc.fastx import read_fasta
import os
import os.path as op

from __future__ import print_function

import argparse
import contextlib
import fileinput
import gzip
import itertools
import os
import shutil
import subprocess
import six
import sys
import tempfile
import time
from collections import defaultdict
from distutils.spawn import find_executable

@contextlib.contextmanager
def file_transaction(*rollback_files):
    """
    Wrap file generation in a transaction, moving to output if finishes.
    """
    safe_names, orig_names = _flatten_plus_safe(rollback_files)
    # remove any half-finished transactions
    remove_files(safe_names)
    try:
        if len(safe_names) == 1:
            yield safe_names[0]
        else:
            yield tuple(safe_names)
    # failure -- delete any temporary files
    except:
        remove_files(safe_names)
        remove_tmpdirs(safe_names)
        raise
    # worked -- move the temporary files to permanent location
    else:
        for safe, orig in zip(safe_names, orig_names):
            if os.path.exists(safe):
                shutil.move(safe, orig)
        remove_tmpdirs(safe_names)


def remove_tmpdirs(fnames):
    for x in fnames:
        xdir = os.path.dirname(os.path.abspath(x))
        if xdir and os.path.exists(xdir):
            shutil.rmtree(xdir, ignore_errors=True)


def remove_files(fnames):
    for x in fnames:
        if x and os.path.exists(x):
            if os.path.isfile(x):
                os.remove(x)
            elif os.path.isdir(x):
                shutil.rmtree(x, ignore_errors=True)


def _flatten_plus_safe(rollback_files):
    """
    Flatten names of files and create temporary file names.
    """
    tx_files, orig_files = [], []
    for fnames in rollback_files:
        if isinstance(fnames, six.string_types):
            fnames = [fnames]
        for fname in fnames:
            basedir = safe_makedir(os.path.dirname(fname))
            tmpdir = safe_makedir(tempfile.mkdtemp(dir=basedir))
            tx_file = os.path.join(tmpdir, os.path.basename(fname))
            tx_files.append(tx_file)
            orig_files.append(fname)
    return tx_files, orig_files


def safe_makedir(dname):
    """
    Make a directory if it doesn't exist, handling concurrent race conditions.
    """
    if not dname:
        return dname
    num_tries = 0
    max_tries = 5
    while not os.path.exists(dname):
        try:
            os.makedirs(dname)
        except OSError:
            if num_tries > max_tries:
                raise
            num_tries += 1
            time.sleep(2)
    return dname


def file_exists(fnames):
    """
    Check if a file or files exist and are non-empty.

    parameters
        fnames : file path as string or paths as list; if list, all must exist

    returns
        boolean
    """
    if isinstance(fnames, six.string_types):
        fnames = [fnames]
    for f in fnames:
        if not os.path.exists(f) or os.path.getsize(f) == 0:
            return False
    return True


def check_dependencies(executables):
    exes = []
    for exe in executables:
        if not find_executable(exe):
            exes.append(exe)
    if len(exes) > 0:
        for exe in exes:
            print("`%s` not found in PATH." % exe)
        sys.exit(1)


def name_from_path(path):
    file, ext = os.path.splitext(os.path.basename(path))
    if ext == ".gz":
        file, ext = os.path.splitext(file)
    return file

def readfa(fh):
    for header, group in itertools.groupby(fh, lambda line: line[0] == '>'):
        if header:
            line = next(group)
            name = line[1:].strip()
        else:
            seq = ''.join(line.strip() for line in group)
            yield name, seq
            
def format_fasta_record(name, seq, wrap=100):
    """Fasta __str__ method.

    Convert fasta name and sequence into wrapped fasta format.

    Args:
        name (str): name of the record
        seq (str): sequence of the record
        wrap (int): length of sequence per line

    Returns:
        tuple: name, sequence

    >>> format_fasta_record("seq1", "ACTG")
    ">seq1\nACTG"
    """
    record = ">{name}\n".format(name=name)
    if wrap:
        for i in range(0, len(seq), wrap):
            record += seq[i:i+wrap] + "\n"
    else:
        record += seq + "\n"
    return record.strip()

def run_cd_hit(input_fa, output_fa, c=0.9, G=1, b=20, M=800,
    T=1, n=5, l=10, t=2, d=20, s=0.0, S=999999, aL=0.0, AL=99999999, aS=0.0,
    AS=99999999, A=0, uL=1.0, uS=1.0, U=99999999, g=1, sc=0, sf=0):
    """Run CD-HIT to cluster input FASTA.

    Args:
        input_fa (str): File path to fasta.
        output_fa (str): File path of output fasta.
        c (Optional[float]): sequence identity threshold, default 0.9
 	        this is the default cd-hit's "global sequence identity" calculated as:
 	        number of identical amino acids in alignment
            divided by the full length of the shorter sequence
        G (Optional[int]): use global sequence identity, default 1
 	        if set to 0, then use local sequence identity, calculated as :
            number of identical amino acids in alignment
 	        divided by the length of the alignment
 	        NOTE!!! don't use -G 0 unless you use alignment coverage controls
 	        see options -aL, -AL, -aS, -AS
        b (Optional[int]): band_width of alignment, default 20
        M (Optional[int]): memory limit (in MB) for the program, default 800; 0 for unlimited
        T (Optional[int]): number of threads, default 1; with 0, all CPUs will be used
        n (Optional[int]): word_length, default 5, see user's guide for choosing it
        l (Optional[int]): length of throw_away_sequences, default 10
        t (Optional[int]): tolerance for redundance, default 2
        d (Optional[int]): length of description in .clstr file, default 20
 	        if set to 0, it takes the fasta defline and stops at first space
        s (Optional[float]): length difference cutoff, default 0.0
 	        if set to 0.9, the shorter sequences need to be
            at least 90% length of the representative of the cluster
        S (Optional[int]): length difference cutoff in amino acid, default 999999
 	        if set to 60, the length difference between the shorter sequences
 	        and the representative of the cluster can not be bigger than 60
        aL (Optional[float]): alignment coverage for the longer sequence, default 0.0
 	        if set to 0.9, the alignment must covers 90% of the sequence
        AL (Optional[int]): alignment coverage control for the longer sequence, default 99999999
 	        if set to 60, and the length of the sequence is 400,
 	        then the alignment must be >= 340 (400-60) residues
        aS (Optional[float]): alignment coverage for the shorter sequence, default 0.0
        	if set to 0.9, the alignment must covers 90% of the sequence
        AS (Optional[int]): alignment coverage control for the shorter sequence, default 99999999
        	if set to 60, and the length of the sequence is 400,
        	then the alignment must be >= 340 (400-60) residues
        A (Optional[int]): minimal alignment coverage control for the both sequences, default 0
        	alignment must cover >= this value for both sequences
        uL (Optional[float]): maximum unmatched percentage for the longer sequence, default 1.0
        	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
        	must not be more than 10% of the sequence
        uS (Optional[float]): maximum unmatched percentage for the shorter sequence, default 1.0
        	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
        	must not be more than 10% of the sequence
        U (Optional[int]): maximum unmatched length, default 99999999
        	if set to 10, the unmatched region (excluding leading and tailing gaps)
        	must not be more than 10 bases
        g (Optional[int]): 1 or 0, default 1
        	when 0 a sequence is clustered to the first
        	cluster that meet the threshold (fast cluster). If set to 1, the program
        	will cluster it into the most similar cluster that meet the threshold
        	(accurate but slow mode)
        	but either 1 or 0 won't change the representatives of final clusters
        sc (Optional[int]): sort clusters by size (number of sequences), default 0, output clusters by decreasing length
        	if set to 1, output clusters by decreasing size
        sf (Optional[int]): sort fasta/fastq by cluster size (number of sequences), default 0, no sorting
        	if set to 1, output sequences by decreasing cluster size

    Returns:
        list, [file path of output fasta, file path of output cluster definitions]

    """
    output_clstr = "{fa}.clstr".format(fa=output_fa)
    output_files = [output_fa, output_clstr]
    if file_exists(output_files):
        return output_files

    print("Running CD-HIT on {fa}".format(fa=input_fa), file=sys.stderr)

    contig_name_map = {}
    tmp_fasta = "{fa}.rename.tmp".format(fa=input_fa)
    with open(input_fa) as f_in, open(tmp_fasta, "w") as f_out:
        for i, (name, seq) in enumerate(readfa(f_in), start=1):
            contig_name_map["%d" % i] = name
            print(format_fasta_record(i, seq), file=f_out)

    with file_transaction(output_files) as tx_out_files:
        cmd = ("cd-hit -i {input_fasta} -o {output_fasta} -c {c} "
                "-G {G} -b {b} -M {M} -T {T} -n {n} -l {l} -t {t} "
                "-d {d} -s {s} -S {S} -aL {aL} -AL {AL} -aS {aS} "
                "-AS {AS} -A {A} -uL {uL} -uS {uS} -U {U} "
                "-p 1 -g {g} -sc {sc} -sf {sf}").format(input_fasta=tmp_fasta,
                                                        output_fasta=tx_out_files[0],
                                                        **locals())
        subprocess.check_call(cmd, shell=True)
        # copy the clstr output to its temp file location; let file_transaction clean up
        shutil.copyfile("{fa}.clstr".format(fa=tx_out_files[0]), tx_out_files[1])

        # edit the output files in place back to their original names
        # changes the format of the cluster file

        # update change the contig names in the cluster file back to original
        for line in fileinput.input(tx_out_files[0], inplace=True):
            line = line.strip()
            if line.startswith(">"):
                name = contig_name_map[line.strip(">")]
                print(">{name}".format(name=name))
            else:
                print(line)

        # change the contig names in the cluster file
        for line in fileinput.input(tx_out_files[1], inplace=True):
            line = line.strip()
            if not line.startswith(">"):
                # changes:
                # 1	382aa, >6... at 1:382:1:382/100.00%
                # to just the original contig name
                if "*" in line:
                    print('{}*'.format(contig_name_map[line.partition(">")[-1].partition("...")[0]]))
                else:
                    print(contig_name_map[line.partition(">")[-1].partition("...")[0]])
            else:
                # this is the cluster ID
                print(line)

    if file_exists(tmp_fasta):
        os.remove(tmp_fasta)

    return output_files

In [21]:
with open("./outputs/s11_allorfs.faa","w") as oh:
    for s in sags:
        for name, seq in read_fasta(open("/mnt/scgc/simon/simonsproject/bats248_annotations/faa/{s}.faa".format(s=s))):
            print(">{name}".format(name=name), file=oh)
            for i in range(0, len(seq),60):
                print(seq[i:i+60], file=oh)

In [23]:
[faa, clstr] = run_cd_hit("./outputs/s11_allorfs.faa", "./outputs/s11_allorfs_cdhit_c-7.faa", c=.7)

In [24]:
cluster_map = defaultdict(list)
with open(clstr) as fh:
    for cluster_start, group in itertools.groupby(fh, lambda l: l[0] == '>'):
        members = []
        if not cluster_start:
            for line in group:
                if "*" in line:
                    rep_seq = line.strip().replace("*", "")
                else:
                    members.append(line.strip())
        if len(members) > 0:
            cluster_map[rep_seq] = members

In [43]:
big_groups = []

for i in cluster_map:
    if len(cluster_map[i]) == 2: 
        big_groups.append(i)        

In [46]:
to_test = [big_groups[0]] + cluster_map[big_groups[0]]

In [47]:
to_test

['AG-905-E11_00497 Bicarbonate transport system permease protein CmpB',
 'AG-896-D15_00504 Bicarbonate transport system permease protein CmpB',
 'AG-904-K13_00125 Bicarbonate transport system permease protein CmpB']

In [49]:
lengths = []

with open("./outputs/sim_transport_prots.fasta", "w") as oh:
    for name, seq in read_fasta(open("./outputs/s11_allorfs.faa")):
        if name in to_test:
            print(">{name}".format(name=name), file=oh)
            for i in range(0, len(seq),60):
                print(seq[i:i+60], file=oh)
            lengths.append(len(seq))

In [50]:
lengths

[641, 641, 641]

In [60]:
count = 0
odd_lens = []
odd_names = []

with open("./outputs/sim_sizes.fasta", "w") as oh:
    for name, seq in read_fasta(open("./outputs/s11_allorfs.faa")):
        if count == 3:
            break
        desc = " ".join(name.split(" ")[1:])
        if abs(len(seq) - 641) < 5 and name not in to_test and desc not in odd_names:
            count += 1
            print(">{name}".format(name=name), file=oh)
            for i in range(0, len(seq),60):
                print(seq[i:i+60], file=oh)
            odd_lens.append(len(seq))
            odd_names.append(" ".join(name.split(" ")[1:]))

In [79]:
def run_mica(fasta, out, db='/mnt/scgc/simon/databases/mica/nr-20150620-mica', num_alignments=10,
           evalue=0.001,
           threads=20, fields = ["qseqid", "sseqid", "pident", "length", "mismatch",
                  "gapopen", "qstart", "qend", "sstart", "send", "evalue",
                  "bitscore", "sallseqid", "score", "nident", "positive",
                  "gaps", "ppos", "qframe", "sframe", "qseq", "sseq", "qlen",
                  "slen", "salltitles"]):
    cmd = ("mica-search --p='{threads}' --blastp 'blastp' {db} {query} "
                   "--blast-args -outfmt '6 {fields}' "
                   "-num_alignments {alignments} -evalue {evalue} -out {out}").format(db=db,
                                                      query=fasta,
                                                      fields=" ".join(fields),
                                                      threads=threads,
                                                      alignments=num_alignments,
                                                      evalue=evalue,
                                                      out=out)
    print(cmd)
    return cmd

def run_blast(fasta, out, db='nr', num_alignments=10,
           evalue=0.001,
           threads=20, fields = ["qseqid", "sseqid", "pident", "length", "mismatch",
                  "gapopen", "qstart", "qend", "sstart", "send", "evalue",
                  "bitscore", "sallseqid", "score", "nident", "positive",
                  "gaps", "ppos", "qframe", "sframe", "qseq", "sseq", "qlen",
                  "slen", "salltitles"]):
    cmd = ("blastp -db {db} -query {query} -outfmt "
                   "'6 {fields}' "
                   "-num_threads {threads} "
                   "-num_alignments {alignments} "
                   "-evalue {evalue} >> {out}").format(db=db,
                                                      query=fasta,
                                                      fields=" ".join(fields),
                                                      threads=threads,
                                                      alignments=num_alignments,
                                                      evalue=evalue,
                                                      out=out)
    print(cmd)
    return cmd

In [89]:
cmd = run_mica("./outputs/sim_sizes.fasta", "./outputs/sim_sizes_mica.out")

mica-search --p='20' --blastp 'blastp' /mnt/scgc/simon/databases/mica/nr-20150620-mica ./outputs/sim_sizes.fasta --blast-args -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles' -num_alignments 10 -evalue 0.001 -out ./outputs/sim_sizes_mica.out


In [90]:
!time {cmd}

Opening database in /mnt/scgc/simon/databases/mica/nr-20150620-mica...
	Opening compressed database...
	Done opening compressed database.
	Opening coarse database...
	Done opening coarse database.
Done opening database in /mnt/scgc/simon/databases/mica/nr-20150620-mica.

Blasting query on coarse database...
blastp -db /mnt/scgc/simon/databases/mica/nr-20150620-mica/blastdb-coarse -num_threads 20 -outfmt 5 -dbsize 24387073819
Decompressing blast hits...
Building fine BLAST database...
Created temporary fine BLAST database in /tmp/mica-fine-search-db652426135
makeblastdb -dbtype prot -title blastdb-fine -in - -out /tmp/mica-fine-search-db652426135/blastdb-fine
Blasting query on fine database...
blastp -db /tmp/mica-fine-search-db652426135/blastdb-fine -dbsize 24387073819 -num_threads 20 -outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles -num_alignments 10 -ev

In [91]:
cmd2 = run_mica("./outputs/sim_transport_prots.fasta", "./outputs/sim_transport_prots.out")

mica-search --p='20' --blastp 'blastp' /mnt/scgc/simon/databases/mica/nr-20150620-mica ./outputs/sim_transport_prots.fasta --blast-args -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles' -num_alignments 10 -evalue 0.001 -out ./outputs/sim_transport_prots.out


In [92]:
!time {cmd2}

Opening database in /mnt/scgc/simon/databases/mica/nr-20150620-mica...
	Opening compressed database...
	Done opening compressed database.
	Opening coarse database...
	Done opening coarse database.
Done opening database in /mnt/scgc/simon/databases/mica/nr-20150620-mica.

Blasting query on coarse database...
blastp -db /mnt/scgc/simon/databases/mica/nr-20150620-mica/blastdb-coarse -num_threads 20 -outfmt 5 -dbsize 24387073819
Decompressing blast hits...
Building fine BLAST database...
Created temporary fine BLAST database in /tmp/mica-fine-search-db654383573
makeblastdb -dbtype prot -title blastdb-fine -in - -out /tmp/mica-fine-search-db654383573/blastdb-fine
Blasting query on fine database...
blastp -db /tmp/mica-fine-search-db654383573/blastdb-fine -dbsize 24387073819 -num_threads 20 -outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles -num_alignments 10 -ev

In [82]:
cmd3 = run_blast("./outputs/sim_sizes.fasta",'./outputs/sim_sizes_blast.out')

blastp -db nr -query ./outputs/sim_sizes.fasta -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles' -num_threads 20 -num_alignments 10 -evalue 0.001 >> ./outputs/sim_sizes_blast.out


In [93]:
!time {cmd3}


real	3m56.374s
user	75m41.137s
sys	0m48.145s


In [84]:
cmd4 = run_blast("./outputs/sim_transport_prots.fasta", "./outputs/sim_transport_prots_blast.out")

blastp -db nr -query ./outputs/sim_transport_prots.fasta -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles' -num_threads 20 -num_alignments 10 -evalue 0.001 >> ./outputs/sim_transport_prots_blast.out


In [94]:
!time {cmd4}


real	3m35.737s
user	69m1.353s
sys	0m48.473s


This is pretty interesting.  The quickest process was using MICA to id the three proteins that were similar to eachother (actually identical).  The MICA run of three unrelated proteins took a little longer, and the normal BLASTs took longer, with the least redundant taking the most time.  

What this means is, if I decide to split up the sequences into groups, I will improve the runtimes if I place similar sequences into the same group.  

In [86]:
!cat ./outputs/sim_sizes.fasta ./outputs/sim_transport_prots.fasta > ./outputs/sim_combo.fasta

In [87]:
cmd5 = run_mica("./outputs/sim_combo.fasta", "./outputs/sim_combo_mica.out")

mica-search --p='20' --blastp 'blastp' /mnt/scgc/simon/databases/mica/nr-20150620-mica ./outputs/sim_combo.fasta --blast-args -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles' -num_alignments 10 -evalue 0.001 -out ./outputs/sim_combo_mica.out


In [88]:
!time {cmd5}

Opening database in /mnt/scgc/simon/databases/mica/nr-20150620-mica...
	Opening compressed database...
	Done opening compressed database.
	Opening coarse database...
	Done opening coarse database.
Done opening database in /mnt/scgc/simon/databases/mica/nr-20150620-mica.

Blasting query on coarse database...
blastp -db /mnt/scgc/simon/databases/mica/nr-20150620-mica/blastdb-coarse -num_threads 20 -outfmt 5 -dbsize 24387073819
Decompressing blast hits...
Building fine BLAST database...
Created temporary fine BLAST database in /tmp/mica-fine-search-db467230881
makeblastdb -dbtype prot -title blastdb-fine -in - -out /tmp/mica-fine-search-db467230881/blastdb-fine
Blasting query on fine database...
blastp -db /tmp/mica-fine-search-db467230881/blastdb-fine -dbsize 24387073819 -num_threads 20 -outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles -num_alignments 10 -ev