# Finding a consensus result for dbCAN

dbCAN is composed of three CAZyme prediction tools: HMMER, Hotpep and DIAMOND. Each can predicate a protein sequence contains multiple CAZyme domains/families.

The aim here is to develop a method that can find a consensus result from three lists, each list containing the predictions from one prediciton tool.

This is not as simple as comparing a to b, then checking if those results are in c becuase there may be common results between b and c that are not in a and thus would be missed.

In [5]:
!pip3 install numpy



In [6]:
import numpy as np

In [7]:
# Example lists to use for trying to a consensus result, starting with an easy example
a = [1, 2, 3, 4]
b = [2, 6]
c = [2, 5]

# first try with sets
consensus = set(a) & set(b) & set(c)
consensus

{2}

In [13]:
# try again with multiple consensus results
aa = [1, 2, 3, 4]
bb = [2, 6]
cc = [2, 5, 6]

consensus_1 = set(aa) & set(bb) & set(cc)
print(consensus_1, type(consensus_1))

{2} <class 'set'>


This appears to only find results that are common to all three lists. Which is not necessarily the result we want. That is the best results, but if that is not found we want a result that appears in at least two of the lists.

If one of the lists is a null value, a check should be added so that it is not included in the consensus-set check, and only look for consensus across the two results lists that are not null values (or do not contain null values).

In [12]:
# First check what happens when comparing lists with null values in
a = [np.nan]
b = [2]
c = [np.nan]

aa = [1, 2]
bb = [1]
cc = [np.nan]

consensus = set(a) & set(b) & set(c)
consensus_1 = set(aa) & set(bb) & set(cc)

print(consensus, repr(consensus))
print(consensus_1, repr(consensus_1))
# This produces an empty set

set() set()
set() set()


An empty set is produced if that is the common result. This could be an alternative to check the number if the '#ofTools#' column potentially, to find a consensus result of a non-CAZyme prediction.

When a consensus is returned how do we retrieve the specific value and how do we check if a set is empty?

In [24]:
# First how to check if set is empty
a = [np.nan]
b = [2]
c = [np.nan]
consensus = set(a) & set(b) & set(c)
print("consensus=", consensus)

if consensus == set():
    print("Empty")
else:
    print("Not", consensus)
    
a = [2, 3]
b = [2]
c = [2]
consensus = set(a) & set(b) & set(c)
print("consensus=", consensus)

if consensus == set():
    print("Empty")
else:
    print("Not", consensus)

consensus= set()
Empty
consensus= {2}
Not {2}


In [27]:
# now to retrieve the value from a set
print(type(consensus))
print(list(consensus))

<class 'set'>
[2]


In [32]:
# what happens if there is no consensus or the consensus is a null value

# no consensus
a = [1, 2]
b = [3, 4]
c = [5, 6]
con = set(a) & set(b) & set(c)
print("no consensus=", con, list(con), len(list(con)), len(con))

# null value consensus
a = [np.nan]
b = [2]
c = [np.nan]
consensus = set(a) & set(b) & set(c)
print("consensus=", consensus, list(consensus), len(list(consensus)), len(consensus))

# when there is a consensus
a = [2, 3]
b = [2]
c = [2]
consensus = set(a) & set(b) & set(c)
print("consensus=", consensus, list(consensus), len(list(consensus)), len(consensus))

no consensus= set() [] 0 0
consensus= set() [] 0 0
consensus= {2} [2] 1 1


The next issue to deal with is **finding the consensus when only two of the tools contain the result**. One approach is a series of comparisons: `if (in a and b) or (in b and c) or (in c and b):`. This could be a slow method if the lists are very long but this is not expected with the dbCAN output, but a more elegant method may be obtainable.

The question is, do three tools agreeing trump two agreeing. If three tools agree do we not include results when only two agree?

Or the results can be spread over two columns, one containing results when three tools agree and other for results where two tools agree.

In [51]:
a = [1, 2, 3, 4, 6]
b = [2, 6, 4, 5]
c = [2, 5, 6]

# the result should be
# 2 and 6 is common to all
# 4 and 5 appear in two of three lists

# first retreve items common to all lists, this builds the list of consensus results
consensus_3 = list(set(a) & set(b) & set(c))
print("common to all=", consensus)

# find items that are in two of the three lists:
consensus_2 = list(set(a) & set(b))
consensus_2 += list(set(a) & set(c))
consensus_2 += list(set(b) & set(c))

# remove duplicates from consensus_2
consensus_2 = list(dict.fromkeys(consensus_2))
# remove items in consensus_2 that are also in consensus_3
for item in consensus_2:
    if item in consensus_3:
        consensus_2.remove(item)

print("final consensus, all=", consensus_3, "two=", consensus_2)

common to all= [2, 2, 2, 2, 6, 2, 2, 6]
final consensus, all= [2, 6] two= [4, 5]


In [None]:
def get_consensus(hmmer, hotpep, diamond):
    """Get consensus results across HMMER, Hotpep and DIAMOND.
    
    Retrieves list of items common to all three tools, and another list of items common to two of the tools.
    
    :param hmmer: list of predictions from HMMER
    :param hotpep: list of predictions from Hotpep
    :param diamond: list of predictions from DIAMOND
    
    Return two lists, items where all 3 tools agree, items where two 2 tools agree
    """
    # Retrieve list of items predicated by all three tools
    print(type(hmmer), type(hotpep), type(diamond))
    consensus_3 = list(set(hmmer) & set(hotpep) & set(diamond))
    if len(consensus_3) == 0:
        consensus_3 = [np.nan]
    
    # Retrieve list of items predicated by two of the tools
    consensus_2 = list(set(hmmer) & set(hotpep))
    consensus_2 += list(set(hmmer) & set(diamond))
    consensus_2 += list(set(hotpep) & set(diamond))
    
    # remove duplicates and items in consensus for all 3 tools
    consensus_2 = list(dict.fromkets(consensus_2))
    for item in consensus_2:
        if item in consensus_3:
            consensus_2.remove(item)
    
    if len(consensus_2) == 0:
        consensus_2 = [np.nan]
    
    return consensus_3, consensus_2


with open("overview.txt", "r") as fh:
    overview_file = fh.read().splitlines()

dbcan_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily"])

for line in overview_file[1:]:
    line = line.split("\t")
    if line[-1] == "1":
        continue  # take the same approach as the other tools, do not include if predicated non-CAZyme
        
    hmmer

    fam_consensus_3, fam_consensus_2 = get_consensus(hmmer_df["cazy_family"], hotpep_df["cazy_family"], diamond_df["cazy_family"])
    subfam_consensus_3, subfam_consensus_2 = get_consensus(hmmer_df["cazy_subfamily"], hotpep_df["cazy_subfamily"], diamond_df["cazy_subfamily"])

    consensus_prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family_3": fam_consensus_3,
        "cazy_subfamily_3": subfam_consensus_3,
        "cazy_family_2": fam_consensus_2,
        "cazy_subfamily_2": subfam_consensus_2,
    }
    
    consensus_prediction_df = pd.DataFrame(consensus_prediction_dict)
    
    dbcan_df = dbcan_df.append(consensus_prediction_df, ignore_index=True)

db_can_df