# Parsing the output from CAZyme prediction tools

The desired output is a Pandas dataframe
Columns:
- Protein ID (unique identifier)
- Predicated CAZy family
- Predicated CAZy subfamily
These columns will be standard across all the dataframes and have the same names.

Each prediction tool will also have additional columns repsective of its output.

In [1]:
!pip3 install pandas
!pip3 install numpy



In [2]:
import pandas as pd
import numpy as np

## Parsing CUPP

### Understanding the output

The output FASTA file and log file produce the same output, but present the data in different formats.
If output data is missing, becuase it was not calculated, nothing is provided in its place. This can make it difficult to determine specifically what data has been produced by CUPP.

#### FASTA File
The format of the data is as follows for each protein
```
> accession | raw_function | best_function | additional_info | subfam \n seq
```

For example:
```
>QIX00022.1 AMS68_005539|&|AA1:1.10.3.2|AA1:1.10.3.2|AA1:6.1(20.3,124..357)|AA1_3
mvrsvnfplgvyfqykfikigtdttltweadpnreytvpsdcntspvlsgpwqyppvstsaavaiaatsttaqasavvtsqptycpntpttrncwadgfdintdfdnhwpttgrtvsytfeitnttlApDgvSRPvfaingqyPGPtIYADWgDMISVtvinklqdngttIHFHGVrQWHSNtqdgvpGLTECpIAPGTTRTyTfqatQfGTSwYHSHFSCQYGdGILGPiviqGpAtANYDIDVGpLpitdwyyptvnyhasitehtngsppvadnalINgTMTSSsggayaltklttgkrhRLrLINTSVDNHFMVSLdghqievitaDFvpIVPYNtTWimlgigqRYDVIITADQAPgnywfraevqdgcganannkniksmfaypghenetpistgsgytgrctdetqlvphwnsyvppkagfaeptffgtalnqstaadgtltlywqvngsslniqwdkptlqyvqesntsypsganvvetategqwtywviqevpgtpynvnvphplhLHGHDFFilGtgtdtwdakfsqnlnynnpprrdtamlpsngwivlawqtdnpgawLFhCHiAwhsdeglgvqfleaadqmqsvnpigtefdqqcqawhdfypvhasylkhdsgv
```

**Accession** This is the protein ID and any other data that was provided on the description line for the protein in its source FASTA file. If the protein is retrieved from the genomic assembly a gene name may have already been assigned to the protein, and this gene name may appear in CAZy for another protein. For example, the GenBank protein [XP_001545605.1](https://www.ncbi.nlm.nih.gov/protein/XP_001545605.1), is annotated as a hypothetical protein. CUPP has labeled it as BCIN_01g00260 and in [UniProt](https://www.uniprot.org/uniprot/A0A384J3W2) this is the name of the gene and in CAZy this is the name of a [AA4 CAZyme](http://www.cazy.org/AA4_all.html) with the GenBank accession number ATZ45204.1.

**Raw function** This is produced when "consider[ing] all EC functions of a CUPP equally", as was described in the comment of the Python script file. This is written as the predicated CAZy family ':' predicated EC number

**Best function** This is the "most occuring EC function of the CUPP group",as was described in the comment of the Python script file. It is inferred from the raw code that the 'CUPP group' refers to the CAZy family k-mer cluster for which the query protein scored most highly and has thus been assigned. -- This will taken as the predicated CAZy family for the query protein. In the fasta file, this is written as the predicated CAZy family ':' predicated EC number.

**Additional information** This includes several pieces of information, each of which are only included if they were calculated by CUPP. Therefore, it can be difficult to determine which specific pieces of inforamtion have been provided. This information is presented in the following order, and as defined in the comments of the Python script:
- "Ratio of maximum frequencies"
- "Ratio of maximum frequencies by coverage"
- "Centre for mass for domain"
- "Domain range" -- this is identifable by the presense of '...' between two numbers and could potentially be useful when looking for regions of positive selection down the road
- "Number of important positions in domain" -- very ambgious of meaning

**Subfam** This is the predicated CAZy subfamily for the query protein

**Seq** This is the query protein sequence -- have not had a change to check why some letters are lower case and some are upper


#### Output log file
The log file presents the same data as the fasta file, but instead of separating the data by '|' it instead separates the data by tabulated white space '\t'.

For example:
QIX00022.1 AMS68_005539	AA1	AA1:6.1	20.3	11.61	256	124..357	115	AA1:1.10.3.2	AA1:1.10.3.2	AA1_3

The data is presented in a slightly different order as well:
- Predicated CAZy family (equivalent to the FASTA file best function)
- CUPP (k-mer cluster) group which query protein scored highest for, and the associated score
- Additional information
- Raw_function (same as the fasta file (cazy_fam:ec#))
- Best_function (same as the fasta file (cazy_fam:ec#))
- Subfam


In [3]:
# reading the fasta file
with open("gca_143535_cupp.fasta", "r") as fh:
    fasta_file = fh.read().splitlines()
for line in fasta_file[:10]:
    print(line)

# data is separated by '|'

>XP_001545605.1 BCIN_01g00260|&|AA4:1.1.3.38|AA4:1.1.3.38|AA4:1.1(13.9,122..568)|
maasivttdtplalrvpnaattpaandplalppgtssekfaefishaeqttgaenviiirkkeeltkelytdpskahdmyhlfdkdyfvvsatiaprnveevqaivrlcndyeiplwpfsigrnvGYGgAAPRVPGSigldmgknmnkvlevnsegayallEPGvTFFSLydylkekklddqlwidtpdlgggsvvGNTIERGvGYTPYGDHFMMHCGLeVVlPTGELIRTGMGALPDPTQptveggsldqqpgnkcwqlfpygfgpYnDGIfSQSNlGIVTKMGIWLMpNPGGyQpylitfpkdtdlpkivdiirplrlqmviqnvpsirhilldaavmgtkksyydvdrplneeeldaiakelnlgrwnfygAlYgPKPvrdvlwqvvkdafgtiegakfflpedikekcvlhiraktlqgiptidelswvdwlpngahlffspiskisgedaskqyaltqkmtleagfdfigtftigmremhhiiclvfdrededqkrrahklireliqvcadngwgeYrTHIALMdqiaetygwndnaqmklnekiknaLDpKGILaPGKNGVWpAsYdrsawrldansertrtrp
>XP_024545922.1 BCIN_01g00340|&|GT2:2.4.1.117|GT2:2.4.1.117|GT2:23.1(20.2,116..300)|
mdgllelpvqilgglwevvrdtpvhvlgviligllglglgliyallllvapvprppypsektyitttpsgttetkpltcwydswvahreasvsksdscpantirtgaiepatlemslvvPaYNEEERLIgMLeealsfldttygrvargtgtgtgyeillvNDgSRDRTVEIAlDFSrknglhdvlrivtleenRGkGGAVTHGMRHVRGEYAVfADADGASRFADLGkLvkgvrevvdeeg

In [4]:
# read the raw text in the fasta file
for line in fasta_file[:10]:
    print(repr(line))

'>XP_001545605.1 BCIN_01g00260|&|AA4:1.1.3.38|AA4:1.1.3.38|AA4:1.1(13.9,122..568)|'
'maasivttdtplalrvpnaattpaandplalppgtssekfaefishaeqttgaenviiirkkeeltkelytdpskahdmyhlfdkdyfvvsatiaprnveevqaivrlcndyeiplwpfsigrnvGYGgAAPRVPGSigldmgknmnkvlevnsegayallEPGvTFFSLydylkekklddqlwidtpdlgggsvvGNTIERGvGYTPYGDHFMMHCGLeVVlPTGELIRTGMGALPDPTQptveggsldqqpgnkcwqlfpygfgpYnDGIfSQSNlGIVTKMGIWLMpNPGGyQpylitfpkdtdlpkivdiirplrlqmviqnvpsirhilldaavmgtkksyydvdrplneeeldaiakelnlgrwnfygAlYgPKPvrdvlwqvvkdafgtiegakfflpedikekcvlhiraktlqgiptidelswvdwlpngahlffspiskisgedaskqyaltqkmtleagfdfigtftigmremhhiiclvfdrededqkrrahklireliqvcadngwgeYrTHIALMdqiaetygwndnaqmklnekiknaLDpKGILaPGKNGVWpAsYdrsawrldansertrtrp'
'>XP_024545922.1 BCIN_01g00340|&|GT2:2.4.1.117|GT2:2.4.1.117|GT2:23.1(20.2,116..300)|'
'mdgllelpvqilgglwevvrdtpvhvlgviligllglglgliyallllvapvprppypsektyitttpsgttetkpltcwydswvahreasvsksdscpantirtgaiepatlemslvvPaYNEEERLIgMLeealsfldttygrvargtgtgtgyeillvNDgSRDRTVEIAlDFSrknglhdvlrivtleenRGkGGAVTHGMRHVRGEYAVfADADGASRFADLGkLvkgvr

In [5]:
# read the log file
with open("gca_143535_cupp.fasta.log", "r") as lfh:
    log_file = lfh.read().splitlines()

for line in log_file[:10]:
    print(line)

XP_001545605.1 BCIN_01g00260	AA4	AA4:1.1	13.9	7.39	268	122..568	107	AA4:1.1.3.38	AA4:1.1.3.38	
XP_024545922.1 BCIN_01g00340	GT2	GT2:23.1	20.2	7.74	211	116..300	77	GT2:2.4.1.117	GT2:2.4.1.117	
XP_001547235.1 BCIN_01g00640	AA3	AA3:3.1	64.0	144.16	274	26..553	453			AA3_1
XP_001547254.1 BCIN_01g00800	AA1	AA1:6.1	10.5	5.13	329	202..455	98	AA1:1.10.3.2	AA1:1.10.3.2	AA1_3
XP_024546029.1 BCIN_01g01760	GH18	GH18:12.1	11.4	2.38	182	116..258	42			
XP_024546029.1 BCIN_01g01760	PL7	PL7:0.1	1.1	0.11	1023	989..1040	20			PL7:2
XP_001548518.1 BCIN_01g02310	GT39	GT39:2.1	71.3	44.37	114	62..264	125	GT39:2.4.1.109	GT39:2.4.1.109	
XP_001548545.1 BCIN_01g02520	GT2	GT2:11.1	33.2	16.05	415	213..452	97	GT2:2.4.1.16	GT2:2.4.1.16	
XP_001549934.1 BCIN_01g02970	AA11	AA11:1.1	6.6	1.08	104	55..160	33	AA11:1.*.*.*	AA11:1.*.*.*	
XP_024546135.1 BCIN_01g03210	GH71	GH71:1.2	7.0	2.08	181	71..381	60	GH71:3.2.1.59	GH71:3.2.1.59	


In [6]:
# read the raw text in the log file
for line in log_file[:10]:
    print(repr(line))
# data is separated by '\t'

'XP_001545605.1 BCIN_01g00260\tAA4\tAA4:1.1\t13.9\t7.39\t268\t122..568\t107\tAA4:1.1.3.38\tAA4:1.1.3.38\t'
'XP_024545922.1 BCIN_01g00340\tGT2\tGT2:23.1\t20.2\t7.74\t211\t116..300\t77\tGT2:2.4.1.117\tGT2:2.4.1.117\t'
'XP_001547235.1 BCIN_01g00640\tAA3\tAA3:3.1\t64.0\t144.16\t274\t26..553\t453\t\t\tAA3_1'
'XP_001547254.1 BCIN_01g00800\tAA1\tAA1:6.1\t10.5\t5.13\t329\t202..455\t98\tAA1:1.10.3.2\tAA1:1.10.3.2\tAA1_3'
'XP_024546029.1 BCIN_01g01760\tGH18\tGH18:12.1\t11.4\t2.38\t182\t116..258\t42\t\t\t'
'XP_024546029.1 BCIN_01g01760\tPL7\tPL7:0.1\t1.1\t0.11\t1023\t989..1040\t20\t\t\tPL7:2'
'XP_001548518.1 BCIN_01g02310\tGT39\tGT39:2.1\t71.3\t44.37\t114\t62..264\t125\tGT39:2.4.1.109\tGT39:2.4.1.109\t'
'XP_001548545.1 BCIN_01g02520\tGT2\tGT2:11.1\t33.2\t16.05\t415\t213..452\t97\tGT2:2.4.1.16\tGT2:2.4.1.16\t'
'XP_001549934.1 BCIN_01g02970\tAA11\tAA11:1.1\t6.6\t1.08\t104\t55..160\t33\tAA11:1.*.*.*\tAA11:1.*.*.*\t'
'XP_024546135.1 BCIN_01g03210\tGH71\tGH71:1.2\t7.0\t2.08\t181\t71..381\t60\tGH71:3.2

A user or database may chose to put a '|' within the name of their protein or species. Therefore, using '|' to identify separations between individual items of data may be a brittle approach.

It is very unlikely a database or user would use `\t` within the name of their protein, especially because CUPP will be reading FASTA files written by `pyrewton` which does not put `\t` within names. Additionally, a note can be made in the README to tell users if they create their own fasta files do not include `\t` in the description line. The same could be said for `|` but the issue is dealing with databases and they cannot be told how to present their data.

Each line in the log file contains a unique protein. The first item is the description retrieved from the FASTA file, the first item of which is the protein accession number retrieved from UniProt of the genomic assembly. The second item is the gene name.

In [7]:
# first split the line into the individual pieces of data
for line in log_file[:4]:
    prediction_output = line.split("\t")
    print(prediction_output)

['XP_001545605.1 BCIN_01g00260', 'AA4', 'AA4:1.1', '13.9', '7.39', '268', '122..568', '107', 'AA4:1.1.3.38', 'AA4:1.1.3.38', '']
['XP_024545922.1 BCIN_01g00340', 'GT2', 'GT2:23.1', '20.2', '7.74', '211', '116..300', '77', 'GT2:2.4.1.117', 'GT2:2.4.1.117', '']
['XP_001547235.1 BCIN_01g00640', 'AA3', 'AA3:3.1', '64.0', '144.16', '274', '26..553', '453', '', '', 'AA3_1']
['XP_001547254.1 BCIN_01g00800', 'AA1', 'AA1:6.1', '10.5', '5.13', '329', '202..455', '98', 'AA1:1.10.3.2', 'AA1:1.10.3.2', 'AA1_3']


In [8]:
for line in log_file[:4]:
    prediction_output = line.split("\t")
    for item in prediction_output:
        if item.find("...") != -1:
            domain_range = item
            break
        else:
            domain_range = None
            pass
    prediction = {
        "protein_accession": prediction_output[0],
        "cazy_family": prediction_output[1],
        "cazy_subfamily": prediction_output[-1],
        "ec_number": prediction_output[-2],
        "domain_range": domain_range,
    }
    print(prediction)


{'protein_accession': 'XP_001545605.1 BCIN_01g00260', 'cazy_family': 'AA4', 'cazy_subfamily': '', 'ec_number': 'AA4:1.1.3.38', 'domain_range': None}
{'protein_accession': 'XP_024545922.1 BCIN_01g00340', 'cazy_family': 'GT2', 'cazy_subfamily': '', 'ec_number': 'GT2:2.4.1.117', 'domain_range': None}
{'protein_accession': 'XP_001547235.1 BCIN_01g00640', 'cazy_family': 'AA3', 'cazy_subfamily': 'AA3_1', 'ec_number': '', 'domain_range': None}
{'protein_accession': 'XP_001547254.1 BCIN_01g00800', 'cazy_family': 'AA1', 'cazy_subfamily': 'AA1_3', 'ec_number': 'AA1:1.10.3.2', 'domain_range': None}


This generates dataframes that can easily be added to a Pandas dataframe.

Check the meaning of PL7:2 in subfamily. There is not PL7_2 subfam, is this a CUPP suggested subfam.

In [9]:
# build an empty dataframe
cupp_df = pd.DataFrame(columns=["protein_accession","cazy_family","cazy_subfamily","ec_number","domain_range"])

# add in predictions from log file to dataframe
for line in log_file[:10]:
    prediction_output = line.split("\t")
    
    # retrieve domain range if given
    for item in prediction_output:
        if item.find("..") != -1:
            domain_range = item
            break
        else:
            domain_range = np.nan

    try:
        ec_number = prediction_output[-2].split(":")[1]
    except IndexError:
        ec_number = np.nan

    # build dict to enable easy building of df
    prediction = {
        "protein_accession": [prediction_output[0]],
        "cazy_family": [prediction_output[1]],
        "cazy_subfamily": [prediction_output[-1]],
        "ec_number": ec_number,
        "domain_range": [domain_range],
    }

    # change empty string to null value
    if len(prediction["cazy_subfamily"][0]) == 0:
        prediction["cazy_subfamily"][0] = np.nan
    # If CUPP suggested subfamily, i.e. a subfamily no in CAZy, mark this with '*'
    else:
        if prediction["cazy_subfamily"][0].find(":") != -1:
            prediction["cazy_subfamily"][0] = "*" + prediction["cazy_subfamily"][0]
            print(prediction["cazy_subfamily"][0])

    prediction_df = pd.DataFrame(prediction)
    cupp_df = cupp_df.append(prediction_df)

cupp_df.head()

*PL7:2


Unnamed: 0,protein_accession,cazy_family,cazy_subfamily,ec_number,domain_range
0,XP_001545605.1 BCIN_01g00260,AA4,,1.1.3.38,122..568
0,XP_024545922.1 BCIN_01g00340,GT2,,2.4.1.117,116..300
0,XP_001547235.1 BCIN_01g00640,AA3,AA3_1,,26..553
0,XP_001547254.1 BCIN_01g00800,AA1,AA1_3,1.10.3.2,202..455
0,XP_024546029.1 BCIN_01g01760,GH18,,,116..258


As a single function:

In [10]:
def parse_cupp_output(log_file_path):
    """Parse the output from the output log file from CUPP and write out data to a dataframe.
    
    Retrieves the protein accession/name/identifier, predicated CAZy family, predicated CAZy subfamily,
    predicated EC number and predicated range of domain within the protein sequence (the index of the
    first and last residues of the domain).
    
    :param log_file_path: path, path to the output log file
    
    Return Pandas dataframe containing CUPP output
    """
    
    with open(log_file_path, "r") as lfh:
        log_file = lfh.read().splitlines()
    
    # build an empty dataframe to add predication outputs to
    cupp_df = pd.DataFrame(columns=["protein_accession","cazy_family","cazy_subfamily","ec_number","domain_range"])

    # add in predictions from log file to dataframe
    for line in log_file:
        prediction_output = line.split("\t")

        # retrieve domain range if given
        for item in prediction_output:
            if item.find("..") != -1:
                domain_range = item
                break
            else:
                domain_range = np.nan
        
        try:
            ec_number = prediction_output[-2].split(":")[1]
        except IndexError:
            ec_number = np.nan

        # build dict to enable easy building of df
        prediction = {
            "protein_accession": [prediction_output[0]],
            "cazy_family": [prediction_output[1]],
            "cazy_subfamily": [prediction_output[-1]],
            "ec_number": [ec_number],
            "domain_range": [domain_range],
        }

        # change empty string to null value
        if len(prediction["cazy_subfamily"][0]) == 0:
            prediction["cazy_subfamily"][0] = np.nan
        # If CUPP suggested subfamily, i.e. a subfamily no in CAZy, mark this with '*'
        else:
            if prediction["cazy_subfamily"][0].find(":") != -1:
                prediction["cazy_subfamily"][0] = "*" + prediction["cazy_subfamily"][0]
                print(prediction["cazy_subfamily"][0])

        prediction_df = pd.DataFrame(prediction)
        cupp_df = cupp_df.append(prediction_df)

    return cupp_df

parsed_cupp_result = parse_cupp_output("gca_143535_cupp.fasta.log")
parsed_cupp_result


*PL7:2
*AA1:2
*GH13:31+GH13:40
*PL1:4
*AA3:2
*AA1:3
*AA3:2
*AA1:3
*AA1:3
*AA1:3
*GH5:49
*AA3:1
*PL1:4
*AA3:2
*AA3:2
*AA3:2
*PL1:4


Unnamed: 0,protein_accession,cazy_family,cazy_subfamily,ec_number,domain_range
0,XP_001545605.1 BCIN_01g00260,AA4,,1.1.3.38,122..568
0,XP_024545922.1 BCIN_01g00340,GT2,,2.4.1.117,116..300
0,XP_001547235.1 BCIN_01g00640,AA3,AA3_1,,26..553
0,XP_001547254.1 BCIN_01g00800,AA1,AA1_3,1.10.3.2,202..455
0,XP_024546029.1 BCIN_01g01760,GH18,,,116..258
...,...,...,...,...,...
0,XP_001547801.1 BCIN_16g04060,GH16,,3.2.1.6-GH16,74..281
0,XP_024553953.1 BCIN_16g04060,GH16,,,74..134
0,XP_024553962.1 BCIN_16g04160,GH47,,3.2.1.113,129..565
0,XP_024553961.1 BCIN_16g04160,GH47,,3.2.1.113,134..570


## Parsing the output from eCAMI

### Understanding the output from eCAMI

eCAMI produces a single output file, a plain text file.

Each protein is given two lines, and a new protein is indicated by the first line starting with '>'.

For example:
```
>XP_001545605.1 BCIN_01g00260 	AA4:3	AA4:24|1.1.3.38:1
EIPLWPFS(113),IPLWPFSI(114),PLWPFSIG(115),LWPFSIGR(116),WPFSIGRN(117),PFSIGRNV(118),FSIGRNVG(119),SIGRNVGY(120),IGRNVGYG(121),GRNVGYGG(122),RNVGYGGA(123),NVGYGGAA(124),VGYGGAAP(125),GYGGAAPR(126),YGGAAPRV(127),GGAAPRVP(128),GAAPRVPG(129),AAPRVPGS(130),APRVPGSI(131),PRVPGSIG(132),RVPGSIGL(133),VPGSIGLD(134),DTPDLGGG(186),TPDLGGGS(187),PDLGGGSV(188),ERGVGYTP(201),RGVGYTPY(202),GVGYTPYG(203),VGYTPYGD(204),GYTPYGDH(205),YTPYGDHF(206),TPYGDHFM(207),PYGDHFMM(208),YGDHFMMH(209),GDHFMMHC(210),DHFMMHCG(211),GLEVVLPT(218),LEVVLPTG(219),EVVLPTGE(220),IRTGMGAL(229),RTGMGALP(230),TGMGALPD(231),GMGALPDP(232),LFPYGFGP(260),FPYGFGPY(261),YGFGPYND(263),GFGPYNDG(264),FGPYNDGI(265),GPYNDGIF(266),DGIFSQSN(270),FSQSNLGI(273),QSNLGIVT(275),SNLGIVTK(276),WLMPNPGG(287),LMPNPGGY(288),MPNPGGYQ(289),IRHILLDA(330),RHILLDAA(331),HILLDAAV(332),LLDAAVMG(334),LNLGRWNF(365),NLGRWNFY(366),LGRWNFYG(367),GRWNFYGA(368),RWNFYGAL(369),WNFYGALY(370),NFYGALYG(371),FYGALYGP(372),DWLPNGAH(434),WLPNGAHL(435),LPNGAHLF(436),PNGAHLFF(437),NGAHLFFS(438),GAHLFFSP(439),AHLFFSPI(440),EAGFDFIG(468),AGFDFIGT(469),GFDFIGTF(470),GMREMHHI(480),NGWGEYRT(517),GWGEYRTH(518),IKNALDPK(550),KNALDPKG(551),NALDPKGI(552),ALDPKGIL(553),LDPKGILA(554),DPKGILAP(555),PKGILAPG(556),KGILAPGK(557),GILAPGKN(558),ILAPGKNG(559),
```

The first line contains:
- Protein accession, taken straight from the description/identifier line from the input fasta file
- Predicated CAZy family : group number # believe this i the k-mer cluster/group number
- "Subname of group": "subfam_name_count"

Each of these items are separated by what appears to be a tab ('\t') white space. - This needs to be checked.

There is only one protein accession given, and only one predicated CAZy family. But multiple subgroups can be named, each separated by '|'. This appears to additional domains/modules eCAMI has found, such as other catalytic domains and CBMs.

The predicated CAZy family is the 'best' or highest scoring result.

The second line contains each of the k-mers that were identified in the query sequences and the number of the parent k-mer cluster in parenthesises.

Sometimes many many other domains can be found. For example:
```
>XP_024550696.1 BCIN_09g01110 	CBM1:16	CBM1:138|GH7:117|3.2.1.176:32|GH12:3|3.2.1.132:2|GH45:2|CE16:1|AA9:1|GH5_5:1|CE1:1|3.1.1.-:1|3.2.1.4:1|GH16:1|AA8:1|PL4_1:1|GH30_7:1|CBM18:1|GH43_6:1
```
This seems to frequently occur when the 'best' result is a CBM domain. This would appear to be an issue with identifying CBM domains. It might be that the reason a CBM domain appears to have multiple difference CAZy class functions is becuase the binding of the CBMs associated substrate is shared by many other catalytic domains.

If the predicated subfamily is wanted to be retrieved, the final section of other domains needs to retrieved and filtered to retrieve the item that matches the predicated CAZy family with the addition of "\_" followed by numbers.

Sometimes an EC number is also predicated. This appears in the final section of the first line, but it's position changes. Therefore, if it is to retrieved the final section will need to be iterated over to retrieve the EC number. The EcC number is presented followed by the ':' and the group number.

**Data to collect:**
- Protein accession
- Predicated CAZy family
- Predicated CAZy subfamily
- Predicated EC number
- List of additionally domains - will be good to see if it can pick up multi-domain proteins and the individual functions of the enzyme


In [11]:
# Read the output file from eCAMI
with open("143535_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

for line in ecami_file[:4]:
    print(line)

>XP_001545605.1 BCIN_01g00260 	AA4:3	AA4:24|1.1.3.38:1
EIPLWPFS(113),IPLWPFSI(114),PLWPFSIG(115),LWPFSIGR(116),WPFSIGRN(117),PFSIGRNV(118),FSIGRNVG(119),SIGRNVGY(120),IGRNVGYG(121),GRNVGYGG(122),RNVGYGGA(123),NVGYGGAA(124),VGYGGAAP(125),GYGGAAPR(126),YGGAAPRV(127),GGAAPRVP(128),GAAPRVPG(129),AAPRVPGS(130),APRVPGSI(131),PRVPGSIG(132),RVPGSIGL(133),VPGSIGLD(134),DTPDLGGG(186),TPDLGGGS(187),PDLGGGSV(188),ERGVGYTP(201),RGVGYTPY(202),GVGYTPYG(203),VGYTPYGD(204),GYTPYGDH(205),YTPYGDHF(206),TPYGDHFM(207),PYGDHFMM(208),YGDHFMMH(209),GDHFMMHC(210),DHFMMHCG(211),GLEVVLPT(218),LEVVLPTG(219),EVVLPTGE(220),IRTGMGAL(229),RTGMGALP(230),TGMGALPD(231),GMGALPDP(232),LFPYGFGP(260),FPYGFGPY(261),YGFGPYND(263),GFGPYNDG(264),FGPYNDGI(265),GPYNDGIF(266),DGIFSQSN(270),FSQSNLGI(273),QSNLGIVT(275),SNLGIVTK(276),WLMPNPGG(287),LMPNPGGY(288),MPNPGGYQ(289),IRHILLDA(330),RHILLDAA(331),HILLDAAV(332),LLDAAVMG(334),LNLGRWNF(365),NLGRWNFY(366),LGRWNFYG(367),GRWNFYGA(368),RWNFYGAL(369),WNFYGALY(370),NFYGALYG(371),FYGALYG

In [12]:
# Check how the data is separated
# Reading the code shows it could be tabs
for line in ecami_file[:4]:
    print(repr(line))

'>XP_001545605.1 BCIN_01g00260 \tAA4:3\tAA4:24|1.1.3.38:1'
'EIPLWPFS(113),IPLWPFSI(114),PLWPFSIG(115),LWPFSIGR(116),WPFSIGRN(117),PFSIGRNV(118),FSIGRNVG(119),SIGRNVGY(120),IGRNVGYG(121),GRNVGYGG(122),RNVGYGGA(123),NVGYGGAA(124),VGYGGAAP(125),GYGGAAPR(126),YGGAAPRV(127),GGAAPRVP(128),GAAPRVPG(129),AAPRVPGS(130),APRVPGSI(131),PRVPGSIG(132),RVPGSIGL(133),VPGSIGLD(134),DTPDLGGG(186),TPDLGGGS(187),PDLGGGSV(188),ERGVGYTP(201),RGVGYTPY(202),GVGYTPYG(203),VGYTPYGD(204),GYTPYGDH(205),YTPYGDHF(206),TPYGDHFM(207),PYGDHFMM(208),YGDHFMMH(209),GDHFMMHC(210),DHFMMHCG(211),GLEVVLPT(218),LEVVLPTG(219),EVVLPTGE(220),IRTGMGAL(229),RTGMGALP(230),TGMGALPD(231),GMGALPDP(232),LFPYGFGP(260),FPYGFGPY(261),YGFGPYND(263),GFGPYNDG(264),FGPYNDGI(265),GPYNDGIF(266),DGIFSQSN(270),FSQSNLGI(273),QSNLGIVT(275),SNLGIVTK(276),WLMPNPGG(287),LMPNPGGY(288),MPNPGGYQ(289),IRHILLDA(330),RHILLDAA(331),HILLDAAV(332),LLDAAVMG(334),LNLGRWNF(365),NLGRWNFY(366),LGRWNFYG(367),GRWNFYGA(368),RWNFYGAL(369),WNFYGALY(370),NFYGALYG(371),FY

In [13]:
# separate out the data for each protein
for line in ecami_file[:6]:
    if line.startswith(">"):  # identifies new protein
        prediction_output = line.split("\t")
        print(prediction_output)

['>XP_001545605.1 BCIN_01g00260 ', 'AA4:3', 'AA4:24|1.1.3.38:1']
['>XP_024545922.1 BCIN_01g00340 ', 'GT2:1526', 'GT2:37']
['>XP_001547235.1 BCIN_01g00640 ', 'AA3:9', 'AA3_1:9']


In [14]:
# parse the outputs so in format suitable for final dataframe
for line in ecami_file[:10]:
    if line.startswith(">"):  # identifies new protein
        prediction_output = line.split("\t")
        cazy_fam = prediction_output[1].split(":")[0]
        
        cazy_subfam = np.nan
        ec_number = np.nan
        
        # check for additional domains
        other_domains = prediction_output[2].split("|")
        additional_domains = []
        for domain in other_domains:
            domain = domain.split(":")[0]  # drop the eCAMI group number
            # check if subfamily is predicated for the main CAZy family
            if len(domain.split("_")) != 1:
                cazy_subfam = domain
            # check if predicated EC number
            elif domain.find(".") != -1:
                ec_number = domain
            else:
                additional_domains.append(domain)
        
        # check parsing is correct
        print("additional_domains=", additional_domains)
        print("subfam=", cazy_subfam)
        print("ecnum=", ec_number)
        

additional_domains= ['AA4']
subfam= nan
ecnum= 1.1.3.38
additional_domains= ['GT2']
subfam= nan
ecnum= nan
additional_domains= []
subfam= AA3_1
ecnum= nan
additional_domains= ['CBM18']
subfam= AA1_3
ecnum= 1.10.3.2
additional_domains= ['GH18', 'CBM50', 'CBM18']
subfam= nan
ecnum= nan


In [15]:
        # build dict to enable easy building of df
        prediction = {
            "protein_accession": [prediction_output[0]],
            "cazy_family": [prediction_output[1].split(":")[0]],
            "cazy_subfamily": [cazy_subfam ],
            "ec_number": [ec_number],
            "additional_domains": [additional_domains],
        }
        
        print(prediction)

{'protein_accession': ['>XP_024546029.1 BCIN_01g01760 '], 'cazy_family': ['GH18'], 'cazy_subfamily': [nan], 'ec_number': [nan], 'additional_domains': [['GH18', 'CBM50', 'CBM18']]}


As a single function including building a dataframe and adding the predictions to it:

In [16]:
def parse_ecami_output(txt_file_path):
    """Parse the output from the output text file from eCAMI and write out data to a dataframe.
    
    Retrieves the protein accession/name/identifier, predicated CAZy family, predicated CAZy subfamily,
    predicated EC number and additional/other domains predicated to also be within the protein sequence,
    indicating prediciton of a multiple module enzymes.
    
    :param text_file_path: path, path to the output text file
    
    Return Pandas dataframe containing eCAMI output
    """
    with open(txt_file_path, "r") as fh:
        ecami_file = fh.read().splitlines()
    
    # build an empty dataframe to add predication outputs to
    ecami_df = pd.DataFrame(columns=["protein_accession","cazy_family","cazy_subfamily","ec_number","domain_range"])
        
    # parse the outputs so in format suitable for final dataframe
    for line in ecami_file:
        if line.startswith(">"):  # identifies new protein
            prediction_output = line.split("\t")
            cazy_fam = prediction_output[1].split(":")[0]

            cazy_subfam = np.nan
            ec_number = np.nan

            # check for additional domains
            other_domains = prediction_output[2].split("|")
            additional_domains = []
            for domain in other_domains:
                domain = domain.split(":")[0]  # drop the eCAMI group number
                # check if subfamily is predicated for the main CAZy family
                if len(domain.split("_")) != 1:
                    cazy_subfam = domain
                # check if predicated EC number
                elif domain.find(".") != -1:
                    ec_number = domain
                else:
                    additional_domains.append(domain)
            
            if len(additional_domains) == 0:
                additional_domains = np.nan

            # build dict to enable easy building of df
            prediction = {
                "protein_accession": [prediction_output[0]],
                "cazy_family": [prediction_output[1].split(":")[0]],
                "cazy_subfamily": [cazy_subfam ],
                "ec_number": [ec_number],
                "additional_domains": [additional_domains],
            }
            
            prediction_df = pd.DataFrame(prediction)
            ecami_df = ecami_df.append(prediction_df)
    
    return ecami_df

parsed_ecami_output = parse_ecami_output("143535_output.txt")
parsed_ecami_output


Unnamed: 0,protein_accession,cazy_family,cazy_subfamily,ec_number,domain_range,additional_domains
0,>XP_001545605.1 BCIN_01g00260,AA4,,1.1.3.38,,[AA4]
0,>XP_024545922.1 BCIN_01g00340,GT2,,,,[GT2]
0,>XP_001547235.1 BCIN_01g00640,AA3,AA3_1,,,
0,>XP_001547254.1 BCIN_01g00800,AA1,AA1_3,1.10.3.2,,[CBM18]
0,>XP_024546029.1 BCIN_01g01760,GH18,,,,"[GH18, CBM50, CBM18]"
...,...,...,...,...,...,...
0,>XP_024553954.1 BCIN_16g04060,GH16,,,,[GH16]
0,>XP_024553952.1 BCIN_16g04060,GH16,,,,[GH16]
0,>XP_024553962.1 BCIN_16g04160,GH47,,3.2.1.113,,[GH47]
0,>XP_024553961.1 BCIN_16g04160,GH47,,3.2.1.113,,[GH47]


## Parse output from dbCAN

dbCAN produces a directory of output files. This is becuase dbCAN includes the output from the three prediction tools:
- DIAMOND
- HMMER
- Hotpep

Each prediction tools has its own specific output, which includes additional predictions other than only the predicated CAZy family. A summary output file is also generated which summarises the output from each of the tools.

A summary dataframe, containing the consensus result of the three predictions tools, and a dataframe per prediction tool will be produced for dbCAN.

### The data in the summary dbCAN output file

This data is kept in the overview.txt file, and contains:
- Protein accession ('Gene ID')
- Domains predicated by HMMER, separated by '+' and including domain ranges (the index of the first and last residue in the protein sequence)
- Domains predicated by Hotpep, separated by '+' and includes the k-mer cluster/group to which the CAZy family belongs in parenthesises
- Predicated CAZy family
- The number of tools which predicated the query protein sequence was a CAZyme

If no prediction is made for a given tool, this is indicated by '-'
The data appears to be separated by `\t`, but need to check the script and when parsing the file.

Each protein is must on a separate line.

### The data in the DIAMOND output file

Data is stored in the file `diamond.out`, and data appears to be separated by `\t` but this is a .out file so will need to check. This file may be readable as a dataframe.

Contains:
- Protein accession ("Gene ID")
- Predicated CAZy family
- Percentage identity between query sequence and model of CAZy family
- Length - length of protein?
- Mismatches - number of mismatched residues?
- Gap open
- Gene start and End
- CAZY start and end, but these numbers can be floats, therefore not fully sure what this represents
- E value and Bit score of the alignment

Data to retrieve is the protein accession and predicated CAZy family. Need to check what is meant by the CAZy start and end and if this will be helpful (by looking at the Python script). But most of the data is easier to retrieved from the txt file.

Multiple predicted domains are not as common as the other two tools (HMMER and Hotpep), and are separated by '+'.

Each protein is a separate line.

### The data in the HMMER output file

Data appears to be separated by `\t`, and is in the file `hmmer.out`. Again this is .out file and may be readable as a dataframe.

Contains the:
- HMM profile which matched
- "profile length"?
- protein accession ("Gene ID")
- gene length
- E value of result
- The first residue that matched the CAZy family HMM profile
- The last residue that matches the CAZy family HMM profile
- Gene start and end
- Coverage of the protein sequence by the HMM profiles

Each unique protein is put on a separate line.

Data to retrieve is the protein accession, the CAZy family and the domain range but this will be easier to retrieved from the 'overview.txt' file.

### The data in the Hotpep output file

The data is stored in the `Hotpep.out` file. It includes:
- The predicated CAZy family
- The number of the PPR family associated with the CAZy family
- Protein accession ("Gene ID")
- Frequency - ?
- The number of matched PPR k-mers
- The k-mers that were found in the query sequence
- Signature peptides - ?
- EC number

Data to retrieve includes the protein accession, predicated CAZy family and the EC number. The EC number can be retrieved here but the CAZy family and accesion may be easier to retrieve from the overview, becuase it appears that the other domains are not listed in Hotpep.

Each protein is on a unique line.

In [17]:
# Check how text is separated in overview.txt
with open ("overview.txt", "r") as ofh:
    overview_file = ofh.read().splitlines()

for line in overview_file[:5]:
    print(line)

for line in overview_file[:5]:
    print(repr(line))
# the data is separated by tabs `\t`

Gene ID	HMMER	Hotpep	DIAMOND	#ofTools
XP_001545265.2	GH53(20-346)	GH53(7)	GH53	3
XP_001545514.1	GT2_Chitin_synth_2(1213-1720)	GT2(161)	GT2	3
XP_001545605.1	AA4(28-577)	AA4(1)	AA4	3
XP_001545700.1	CBM21(353-459)	CBM21(4)	CBM21	3
'Gene ID\tHMMER\tHotpep\tDIAMOND\t#ofTools'
'XP_001545265.2\tGH53(20-346)\tGH53(7)\tGH53\t3'
'XP_001545514.1\tGT2_Chitin_synth_2(1213-1720)\tGT2(161)\tGT2\t3'
'XP_001545605.1\tAA4(28-577)\tAA4(1)\tAA4\t3'
'XP_001545700.1\tCBM21(353-459)\tCBM21(4)\tCBM21\t3'


In [18]:
# Check if pandas can easily handle .out files
diamond_df = pd.read_csv("diamond.out")
diamond_df

Unnamed: 0,Gene ID\tCAZy ID\t% Identical\tLength\tMismatches\tGap Open\tGene Start\tGene End\tCAZy Start\tCAZy End\tE Value\tBit Score
0,XP_001545605.1\tXP_001545605.1|AA4\t100.0\t590...
1,XP_024545922.1\tCCD55608.1|GT2\t100.0\t392\t0\...
2,XP_001547235.1\tATZ45248.1|AA3_1\t100.0\t572\t...
3,XP_001547254.1\tATZ45267.1|AA1_3\t100.0\t710\t...
4,XP_024546029.1\tATZ45383.1|GH18\t100.0\t1198\t...
...,...
594,XP_024553945.1\tATZ58688.1|GH1\t100.0\t576\t0\...
595,XP_001547801.1\tATZ58699.1|GH16\t100.0\t315\t0...
596,XP_024553962.1\tATZ58714.1|GH47\t100.0\t568\t0...
597,XP_024553961.1\tCCD55952.1|GH47\t100.0\t573\t0...


Pandas cannot pass pass the output file becuase the data is separated by tabs (`\t`), and setting deliminator for regex expression is not advisable (as written in the Pandas documentation).

Therefore, all the dbCAN output files must be parsed in a similar manner to the output from CUPP and eCAMI.

Start with parsing the data from the overview.txt file to retrieve all the predicated domains for each protein.


I need to also determine what is meanr by CAZy start and CAZy end in the output from DIAMOND. This data is taken straight from the output of DIAMOND when looking at the run_dbcan.py script.

In [19]:
with open("diamond.out", "r") as fh:
    diamond = fh.read().splitlines()

for line in diamond[:5]:
    line = line.split("\t")
    print(line)

['Gene ID', 'CAZy ID', '% Identical', 'Length', 'Mismatches', 'Gap Open', 'Gene Start', 'Gene End', 'CAZy Start', 'CAZy End', 'E Value', 'Bit Score']
['XP_001545605.1', 'XP_001545605.1|AA4', '100.0', '590', '0', '0', '1', '590', '1', '590', '0.0e+00', '1215.7']
['XP_024545922.1', 'CCD55608.1|GT2', '100.0', '392', '0', '0', '1', '392', '1', '392', '2.2e-222', '773.9']
['XP_001547235.1', 'ATZ45248.1|AA3_1', '100.0', '572', '0', '0', '1', '572', '1', '572', '0.0e+00', '1135.2']
['XP_001547254.1', 'ATZ45267.1|AA1_3', '100.0', '710', '0', '0', '1', '710', '1', '710', '0.0e+00', '1236.9']


In [20]:
for line in diamond[:5]:
    line = line.split("\t")
    print("length=", line[3])
    print("cazy start=", line[6])
    print("cazy end=", line[7])

length= Length
cazy start= Gene Start
cazy end= Gene End
length= 590
cazy start= 1
cazy end= 590
length= 392
cazy start= 1
cazy end= 392
length= 572
cazy start= 1
cazy end= 572
length= 710
cazy start= 1
cazy end= 710


In [21]:
# appears that it should be predicating where the protein starts and end
# this maybe a feature for when a DNA and not an AA sequence is given
# check if the CAZy end != length

# first build a df to help with vistualisation
df_data = []
for line in diamond[1:]:
    line = line.split("\t")
    df_data.append(line)
    
diamond_df = pd.DataFrame(df_data, columns=['Gene ID', 'CAZy ID', '% Identical', 'Length', 'Mismatches', 'Gap Open', 'Gene Start', 'Gene End', 'CAZy Start', 'CAZy End', 'E Value', 'Bit Score'])
diamond_df


Unnamed: 0,Gene ID,CAZy ID,% Identical,Length,Mismatches,Gap Open,Gene Start,Gene End,CAZy Start,CAZy End,E Value,Bit Score
0,XP_001545605.1,XP_001545605.1|AA4,100.0,590,0,0,1,590,1,590,0.0e+00,1215.7
1,XP_024545922.1,CCD55608.1|GT2,100.0,392,0,0,1,392,1,392,2.2e-222,773.9
2,XP_001547235.1,ATZ45248.1|AA3_1,100.0,572,0,0,1,572,1,572,0.0e+00,1135.2
3,XP_001547254.1,ATZ45267.1|AA1_3,100.0,710,0,0,1,710,1,710,0.0e+00,1236.9
4,XP_024546029.1,ATZ45383.1|GH18,100.0,1198,0,0,1,1198,1,1198,0.0e+00,1496.1
...,...,...,...,...,...,...,...,...,...,...,...,...
594,XP_024553945.1,ATZ58688.1|GH1,100.0,576,0,0,1,576,1,576,0.0e+00,1191.0
595,XP_001547801.1,ATZ58699.1|GH16,100.0,315,0,0,1,315,1,315,2.0e-181,637.5
596,XP_024553962.1,ATZ58714.1|GH47,100.0,568,0,0,1,568,1,568,0.0e+00,1171.8
597,XP_024553961.1,CCD55952.1|GH47,100.0,573,0,0,1,573,1,573,0.0e+00,1181.4


In [22]:
# check if CAZy end ever differs to length
index = 0
for index in range(len(diamond_df["Gene ID"])):
    if int(diamond_df["Length"].iloc[index]) != int(diamond_df["CAZy End"].iloc[index]):
        print(diamond_df.iloc[index], "\n")

Gene ID         XP_024546124.1
CAZy ID        EAA31326.1|GH71
% Identical               72.7
Length                     600
Mismatches                 152
Gap Open                     4
Gene Start                  94
Gene End                   688
CAZy Start                 623
CAZy End                  1215
E Value               5.6e-253
Bit Score                876.3
Name: 9, dtype: object 

Gene ID         XP_024546154.1
CAZy ID        AWP04987.1|GT13
% Identical               42.6
Length                     967
Mismatches                 522
Gap Open                    14
Gene Start                 119
Gene End                  1065
CAZy Start                  80
CAZy End                  1033
E Value               6.4e-219
Bit Score                763.8
Name: 14, dtype: object 

Gene ID               XP_001547281.1
CAZy ID        EAA64118.1|CBM20|GH15
% Identical                     47.3
Length                           579
Mismatches                       261
Gap Open            

Using XP_024546124.1 as an example. The protien sequence is 695 residues long. Therefore, waht is meant by the CAZy start and end being 623 and 1215 respectively?

Therefore, the CAZy start and end do not appear to be informative when protein sequences are the input sequence type for dbCAN.

Consequently, data for Diamond need only be retrieved from the overview.txt file.

**HMMER**

First, retrieve the output from HMMER. The desired data is all predicated CAZy familes/domains and their respective domain ranges.

The text in the overview.txt file is separated by tabs (`\t`) thus allowing the data for each prediction tool to be easily separated out. The data is presented in the following order:
0. Gene ID (protein name)
1. HMMER
2. Hotpep
3. Diamond
4. #ofTools (number of tools that predicated the protein sequence contained at least one CAZy domain)


In [23]:
# Check how text is separated in overview.txt
with open ("overview.txt", "r") as ofh:
    overview_file = ofh.read().splitlines()

data_ov = []
for line in overview_file[1:5]:  # skip the first line becuase this is the head titles
    line = line.split("\t")
    print(line)
    data_ov.append(line)

# or better presented as a df
df = pd.DataFrame(data_ov, columns=["Gene ID", "HMMER", "Hotpep", "DIAMOND", "#ofTools"])
df

['XP_001545265.2', 'GH53(20-346)', 'GH53(7)', 'GH53', '3']
['XP_001545514.1', 'GT2_Chitin_synth_2(1213-1720)', 'GT2(161)', 'GT2', '3']
['XP_001545605.1', 'AA4(28-577)', 'AA4(1)', 'AA4', '3']
['XP_001545700.1', 'CBM21(353-459)', 'CBM21(4)', 'CBM21', '3']


Unnamed: 0,Gene ID,HMMER,Hotpep,DIAMOND,#ofTools
0,XP_001545265.2,GH53(20-346),GH53(7),GH53,3
1,XP_001545514.1,GT2_Chitin_synth_2(1213-1720),GT2(161),GT2,3
2,XP_001545605.1,AA4(28-577),AA4(1),AA4,3
3,XP_001545700.1,CBM21(353-459),CBM21(4),CBM21,3


When multiple domains are predicated by HMMER the domains are separated by '+' and the CAZy family is immediately followed by its domain range within parenthesises.

Sometimes proteins predicated as containing GT2 domains are given the CAZy family "GT2_Chitin_synth_2". Why this is I am not sure, but there is not GT2_2 subfamily. Therefore, the names of the predicted CAZy families need to be standardised to GT2.

Additionally, when no CAZy domains are predicated to be within the protein sequence a result of '-' is provided. It will be better for the final dataframe if this is converted to a try null value.

Lastly, sometimes HMMER predicates the CAZy subfamily. Unlike with eCAMI and CUPP the subfamily is not passed to a separate column, although this is helpful when it comes to evaluating the tools performances. Therefore, when HMMER predicates the CAZy subfamily this result needs to be passed to retrieve the CAZy family and then store the CAZy family and subfamily separately.

In [24]:
for line in overview_file[1:]:  # skip the first line becuase this is the head titles
    line = line.split("\t")
    
    protein_id = line[0]

    # Retrieve predictions from HMMER (predicated domains and domain ranges)
    # separate out each of the domain predications
    hmmer_prediction = line[1].split("+")
    
    # create empty lists to store predicated domains and ranges
    cazy_fam = []
    cazy_subfam = []
    domain_ranges = []

    # separate out the name of the predicated domain (the CAZy family) and the domain's AA range
    for domain in hmmer_prediction:
        domain = domain.split("(")
        # standardise name if necessary, retrieve subfam if preciated, and produce null values
        domain_name = domain[0]
        if domain_name.startswith("GT2_Chitin_"):
            cazy_fam.append("GT2")
            cazy_subfam.append(np.nan)

        elif domain_name == '-':
            cazy_fam.append(np.nan)
            cazy_subfam.append(np.nan)

        elif domain_name.find("_") != -1:
            cutoff = domain_name.find("_")
            cazy_fam.append(domain_name[:cutoff])
            cazy_subfam.append(domain_name)

        else:
            cazy_fam.append(domain_name)
            cazy_subfam.append(np.nan)

        try:
            if domain[1] == "-":
                domain_ranges.append(np.nan)
            else:
                domain_ranges.append(domain[1][:-1])  # exlude the final ")"
        except IndexError: # no domain predicated so no domain ranges
            domain_ranges.append(np.nan)

    print("hmmer_prediction=", hmmer_prediction, "familes=", cazy_fam, "subfams", cazy_subfam, "domain_ranges=", domain_ranges)


hmmer_prediction= ['GH53(20-346)'] familes= ['GH53'] subfams [nan] domain_ranges= ['20-346']
hmmer_prediction= ['GT2_Chitin_synth_2(1213-1720)'] familes= ['GT2'] subfams [nan] domain_ranges= ['1213-1720']
hmmer_prediction= ['AA4(28-577)'] familes= ['AA4'] subfams [nan] domain_ranges= ['28-577']
hmmer_prediction= ['CBM21(353-459)'] familes= ['CBM21'] subfams [nan] domain_ranges= ['353-459']
hmmer_prediction= ['GH132(57-318)'] familes= ['GH132'] subfams [nan] domain_ranges= ['57-318']
hmmer_prediction= ['GT32(85-165)'] familes= ['GT32'] subfams [nan] domain_ranges= ['85-165']
hmmer_prediction= ['GT25(58-285)'] familes= ['GT25'] subfams [nan] domain_ranges= ['58-285']
hmmer_prediction= ['GH17(442-672)'] familes= ['GH17'] subfams [nan] domain_ranges= ['442-672']
hmmer_prediction= ['GT71(136-358)'] familes= ['GT71'] subfams [nan] domain_ranges= ['136-358']
hmmer_prediction= ['GH16(87-284)'] familes= ['GH16'] subfams [nan] domain_ranges= ['87-284']
hmmer_prediction= ['AA2(84-274)'] familes= 

hmmer_prediction= ['GH1(166-661)'] familes= ['GH1'] subfams [nan] domain_ranges= ['166-661']
hmmer_prediction= ['GH17(24-303)'] familes= ['GH17'] subfams [nan] domain_ranges= ['24-303']
hmmer_prediction= ['GH51(182-653)'] familes= ['GH51'] subfams [nan] domain_ranges= ['182-653']
hmmer_prediction= ['GT90(349-621)'] familes= ['GT90'] subfams [nan] domain_ranges= ['349-621']
hmmer_prediction= ['GH17(82-309)'] familes= ['GH17'] subfams [nan] domain_ranges= ['82-309']
hmmer_prediction= ['GH28(21-361)'] familes= ['GH28'] subfams [nan] domain_ranges= ['21-361']
hmmer_prediction= ['GH13_31(38-394)'] familes= ['GH13'] subfams ['GH13_31'] domain_ranges= ['38-394']
hmmer_prediction= ['GT24(1186-1433)'] familes= ['GT24'] subfams [nan] domain_ranges= ['1186-1433']
hmmer_prediction= ['GH6(125-418)'] familes= ['GH6'] subfams [nan] domain_ranges= ['125-418']
hmmer_prediction= ['GH5_5(89-384)'] familes= ['GH5'] subfams ['GH5_5'] domain_ranges= ['89-384']
hmmer_prediction= ['PL7_4(71-286)'] familes= ['

hmmer_prediction= ['GH5_9(83-382)'] familes= ['GH5'] subfams ['GH5_9'] domain_ranges= ['83-382']
hmmer_prediction= ['AA3_2(49-672)'] familes= ['AA3'] subfams ['AA3_2'] domain_ranges= ['49-672']
hmmer_prediction= ['GH12(116-258)'] familes= ['GH12'] subfams [nan] domain_ranges= ['116-258']
hmmer_prediction= ['GH55(52-808)'] familes= ['GH55'] subfams [nan] domain_ranges= ['52-808']
hmmer_prediction= ['GH135(7-235)'] familes= ['GH135'] subfams [nan] domain_ranges= ['7-235']
hmmer_prediction= ['GT17(70-378)'] familes= ['GT17'] subfams [nan] domain_ranges= ['70-378']
hmmer_prediction= ['GH79(41-414)'] familes= ['GH79'] subfams [nan] domain_ranges= ['41-414']
hmmer_prediction= ['GH92(288-793)'] familes= ['GH92'] subfams [nan] domain_ranges= ['288-793']
hmmer_prediction= ['GH12(100-246)'] familes= ['GH12'] subfams [nan] domain_ranges= ['100-246']
hmmer_prediction= ['GT2_Glyco_tranf_2_3(219-467)'] familes= ['GT2'] subfams ['GT2_Glyco_tranf_2_3'] domain_ranges= ['219-467']
hmmer_prediction= ['CE

hmmer_prediction= ['CE10(68-314)'] familes= ['CE10'] subfams [nan] domain_ranges= ['68-314']
hmmer_prediction= ['CE10(68-314)'] familes= ['CE10'] subfams [nan] domain_ranges= ['68-314']
hmmer_prediction= ['GH3(83-295)'] familes= ['GH3'] subfams [nan] domain_ranges= ['83-295']
hmmer_prediction= ['GH15(50-458)'] familes= ['GH15'] subfams [nan] domain_ranges= ['50-458']
hmmer_prediction= ['GH71(21-408)', 'CBM24(459-539)', 'CBM24(558-634)'] familes= ['GH71', 'CBM24', 'CBM24'] subfams [nan, nan, nan] domain_ranges= ['21-408', '459-539', '558-634']
hmmer_prediction= ['GT62(100-362)'] familes= ['GT62'] subfams [nan] domain_ranges= ['100-362']
hmmer_prediction= ['CE10(175-582)'] familes= ['CE10'] subfams [nan] domain_ranges= ['175-582']
hmmer_prediction= ['GH15(81-493)', 'CBM20(573-663)'] familes= ['GH15', 'CBM20'] subfams [nan, nan] domain_ranges= ['81-493', '573-663']
hmmer_prediction= ['GH28(183-502)'] familes= ['GH28'] subfams [nan] domain_ranges= ['183-502']
hmmer_prediction= ['GH35(35-36

hmmer_prediction= ['GH12(105-251)'] familes= ['GH12'] subfams [nan] domain_ranges= ['105-251']
hmmer_prediction= ['GH72(20-335)'] familes= ['GH72'] subfams [nan] domain_ranges= ['20-335']
hmmer_prediction= ['AA7(3-335)'] familes= ['AA7'] subfams [nan] domain_ranges= ['3-335']
hmmer_prediction= ['GH28(64-373)'] familes= ['GH28'] subfams [nan] domain_ranges= ['64-373']
hmmer_prediction= ['CE10(137-433)'] familes= ['CE10'] subfams [nan] domain_ranges= ['137-433']
hmmer_prediction= ['CE10(137-426)'] familes= ['CE10'] subfams [nan] domain_ranges= ['137-426']
hmmer_prediction= ['GH5_9(473-796)'] familes= ['GH5'] subfams ['GH5_9'] domain_ranges= ['473-796']
hmmer_prediction= ['CE10(120-380)'] familes= ['CE10'] subfams [nan] domain_ranges= ['120-380']
hmmer_prediction= ['GH76(23-386)'] familes= ['GH76'] subfams [nan] domain_ranges= ['23-386']
hmmer_prediction= ['GT4(221-389)'] familes= ['GT4'] subfams [nan] domain_ranges= ['221-389']
hmmer_prediction= ['GH78(419-833)'] familes= ['GH78'] subfam

hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan] subfams [nan] domain_ranges= [nan]
hmmer_prediction= ['-'] familes= [nan]

**HOTPEP**

Next is to retrieve the data from the Hotpep predications. This data is layed out in the same manner as for HMMER, when looking at the output from both tools in the overview.txt file. However, instead of providing domain ranges within the parenthesises, the number of the CAZy family's associated k-mer cluster/group within Hotpep is given and is thus not wanted/needed.

In [25]:
for line in overview_file[1:15]:  # skip the first line becuase this is the head titles
    line = line.split("\t")

    # Retrieve predictions from Hotpep
    hotpep_prediction = line[2].split("+")
    # separate the predicated CAZy fam and subfam
    cazy_fams = []
    cazy_subfams = []
    # remove the k-mer cluster number
    for prediction in hotpep_prediction:
        prediction = prediction.split("(")[0]
        if prediction == "-":
            cazy_fams.append(np.nan)
            cazy_subfams.append(np.nan)
        elif prediction.find("_") != -1:
            cutoff = prediction.find("_")
            cazy_fams.append(predictions[:cutoff])
            cazy_subfams.append(prediction)
        else:
            cazy_fams.append(prediction)
            cazy_subfams.append(np.nan)
    print("hotpep_prediction=", hotpep_prediction, "familes=", cazy_fams, "subfams", cazy_subfams)

# checked more rows to make sure handling of when no domains is predicated '-' is ok
# there are no subfamily predictions in the output file use by Hotpep can predict CAZy subfamiles

hotpep_prediction= ['GH53(7)'] familes= ['GH53'] subfams [nan]
hotpep_prediction= ['GT2(161)'] familes= ['GT2'] subfams [nan]
hotpep_prediction= ['AA4(1)'] familes= ['AA4'] subfams [nan]
hotpep_prediction= ['CBM21(4)'] familes= ['CBM21'] subfams [nan]
hotpep_prediction= ['GH132(2)'] familes= ['GH132'] subfams [nan]
hotpep_prediction= ['GT32(54)'] familes= ['GT32'] subfams [nan]
hotpep_prediction= ['GT25(35)'] familes= ['GT25'] subfams [nan]
hotpep_prediction= ['GH17(29)'] familes= ['GH17'] subfams [nan]
hotpep_prediction= ['GT71(7)'] familes= ['GT71'] subfams [nan]
hotpep_prediction= ['-'] familes= [nan] subfams [nan]
hotpep_prediction= ['-'] familes= [nan] subfams [nan]
hotpep_prediction= ['GH18(92)', 'CBM1(13)', 'CBM19(1)'] familes= ['GH18', 'CBM1', 'CBM19'] subfams [nan, nan, nan]
hotpep_prediction= ['-'] familes= [nan] subfams [nan]
hotpep_prediction= ['-'] familes= [nan] subfams [nan]


The predicated EC number from Hotpep would also be good but this has to be retrieved from the specific Hotpep output file therefore, to keep focusing on the overview.txt file, lets look at retrieving the output from DIAMOND.

**DIAMOND**

DIAMOND can also predicated multiple domains and these are separated by a '+' and no additional data is given within the overview.txt file, and none of the additional data within the DIAMOND output file is wanted. Therefore, a similar approach to the previous two can be used for retrieving the output for DIAMOND.

Again, if not CAZy domains are predicated from a query protein a '-' is given, which will be converted to a null value.

Lastly, DIAMOND can predict CAZy subfamiles, and like HMMER it does not produce a separate CAZy family and subfamily output, therefore, one needs to be created mannually.

In [26]:
for line in overview_file[1:25]:  # skip the first line becuase this is the head titles
    line = line.split("\t")

    # Retrieve domaind predications from DIAMOND
    diamond_prediction = line[3].split('+')
    cazy_fams = []
    cazy_subfams = []
    for pred in diamond_prediction:
        if pred == '-':
            cazy_fams.append(np.nan)
            cazy_subfams.append(np.nan)
        elif pred.find("_") != -1:
            cutoff = pred.find("_")
            cazy_fams.append(pred[:cutoff])
            cazy_subfams.append(pred)
        else:
            cazy_fams.append(pred)
            cazy_subfams.append(np.nan)
    print("diamond_predictions=", diamond_prediction, "fams=", cazy_fams, "subfams=", cazy_subfams)
# extended number of lines in trial to make sure subfam capture worked

diamond_predictions= ['GH53'] fams= ['GH53'] subfams= [nan]
diamond_predictions= ['GT2'] fams= ['GT2'] subfams= [nan]
diamond_predictions= ['AA4'] fams= ['AA4'] subfams= [nan]
diamond_predictions= ['CBM21'] fams= ['CBM21'] subfams= [nan]
diamond_predictions= ['GH132'] fams= ['GH132'] subfams= [nan]
diamond_predictions= ['GT32'] fams= ['GT32'] subfams= [nan]
diamond_predictions= ['GT25'] fams= ['GT25'] subfams= [nan]
diamond_predictions= ['GH17'] fams= ['GH17'] subfams= [nan]
diamond_predictions= ['GT71'] fams= ['GT71'] subfams= [nan]
diamond_predictions= ['GH16'] fams= ['GH16'] subfams= [nan]
diamond_predictions= ['-'] fams= [nan] subfams= [nan]
diamond_predictions= ['CBM1', 'GH18'] fams= ['CBM1', 'GH18'] subfams= [nan, nan]
diamond_predictions= ['-'] fams= [nan] subfams= [nan]
diamond_predictions= ['-'] fams= [nan] subfams= [nan]
diamond_predictions= ['CBM1', 'GH131'] fams= ['CBM1', 'GH131'] subfams= [nan, nan]
diamond_predictions= ['-'] fams= [nan] subfams= [nan]
diamond_predictions=

**HOTPEP AGAIN**
It could be useful to retrieve the predicated EC number from Hotpep.out (the output file for Hotpep).

Like the overview.txt file, each protein is on a separate line and the data is separated by tables (`\t`). The EC numbers are stored at the end of each line.

In [27]:
with open("Hotpep.out", "r") as hf:
    hotpep_file = hf.read().splitlines()

df_data = []
column_names = hotpep_file[0].split("\t")
for line in hotpep_file[1:5]:
    line = line.split("\t")
    df_data.append(line)
    print(line)

df = pd.DataFrame(df_data, columns=column_names)
df

['CE0', '9', 'XP_001551543.2', '23.2', '48', 'HPTDVG,GHIKVA,VGHIKV,YKGRWD,DSKYVS,VAITFG,IKVASH,DIVVIN,VSWWSA,VLVGYR,KGRWDS,DTLYWN,SYKGRW,AGLGNT,GRWDSK,SWWSAP,DVGHIK,WWSAPG,GLGNTE,SKYVSW,THLLVS,HNDINP,ADIVVI,GNTEYS,KVASHL,NLGTND,HLLVSP,ATHLLV,RWDSKY,YSITAY,RVTNWA,TFELRV,YVSWWS,EYSITA,KYVSWW,TLYWND,YFNTTG,PTDVGH,LGNTEY,ITAYPG,GAGLGN,WDSKYV,YTSDTS,TDVGHI,HIKVAS,SITAYP,LLVSPE,IGDSLS', '3.1.1.-:17']
['CE0', '16', 'XP_024553067.1', '52.7', '70', 'LGTSAG,PNWVEY,VEYLTS,LFMNLP,GGPNWV,FMNLPP,TSAGGP,QNYSFI,MDPSKT,GTSAGG,VQNQLG,DTDSYT,PPLQRT,AGGPNW,HNYTVS,AFGDSY,QNQLGT,DSYTFP,SLENQI,NWVEYL,GDTDSY,QLWDFA,WDFAFA,NYSFIG,DFAFAG,GINDIG,NFSYTP,FPSHNS,VAIWIG,AAIETI,NLPPLQ,FGDSYT,WVEYLT,TFPSHN,NQLGTS,LHHNYT,GLPSKC,IGINDI,LTSCFS,INDIGD,LWDFAF,GDSYTY,NDIGDT,AGSDVS,YSFIGD,AFAGSD,PSKTLV,AIWIGI,MNLPPL,GPNWVE,KQLWDF,FAFAGS,DSYTYV,LVAIWI,YTFPSH,FAAIET,IWIGIN,FSYTPA,SGLPSK,SFIGDL,GSDVST,SYTFPS,SAGGPN,LYTNII,QLGTSA,TSCFSG,DIGDTD,WIGIND,AIETIY,SYTYVQ', 'NA']
['CE0', '16', 'XP_024553068.1', '52.7', '70', 'LGTSAG,PNW

Unnamed: 0,CAZy Family,PPR Subfamily,Gene ID,Frequency,Hits,Signature Peptides,EC number
0,CE0,9,XP_001551543.2,23.2,48,"HPTDVG,GHIKVA,VGHIKV,YKGRWD,DSKYVS,VAITFG,IKVA...",3.1.1.-:17
1,CE0,16,XP_024553067.1,52.7,70,"LGTSAG,PNWVEY,VEYLTS,LFMNLP,GGPNWV,FMNLPP,TSAG...",
2,CE0,16,XP_024553068.1,52.7,70,"LGTSAG,PNWVEY,VEYLTS,LFMNLP,GGPNWV,FMNLPP,TSAG...",
3,CE1,14,XP_024553208.1,6.13,10,"AMMTNV,TTLYPQ,DTTLYP,GTSSGA,SGAMMT,TLYPQN,GAMM...","3.1.1.72:347, 3.1.1.-:53"


Looking at the output, the Gene ID is in the third column (`line[2]`) and the EC number in the last (`line[-1]`). When no EC number is predicated the string "NA" is given but this should be changed to a null value.

Additionally, the EC number needs formating, becuase each predicted EC number is immediately followed by ":_score_" which is not wanted.

Arguably, it would be easier to retrieve all the data from the hotpep.out file, BUT, file only includes the highest scoring CAZy family and does not show the multi-domain prediction.

Therefore, we need to retrieve the EC number and the "Gene ID" (protein name/id) of the associated protein, and make sure this data is linked.

In [28]:
for line in hotpep_file[1:]:
    line = line.split("\t")
    print(line[0])
    

CE0
CE0
CE0
CE1
CE12
CE12
CE12
CE16
CE16
CE16
CE16
CE16
CE2
CE4
CE4
CE4
CE4
CE5
CE5
CE5
CE5
CE5
CE5
CE5
CE5
CE5
CE5
CE8
CE8
CE8
CE8
CE9
CE9
GH0
GH0
GH0
GH0
GH0
GH0
GH0
GH1
GH1
GH1
GH10
GH10
GH105
GH105
GH106
GH106
GH11
GH11
GH11
GH114
GH115
GH12
GH12
GH12
GH125
GH125
GH125
GH127
GH128
GH128
GH13
GH13
GH13
GH13
GH131
GH131
GH131
GH132
GH132
GH132
GH133
GH135
GH135
GH145
GH15
GH15
GH15
GH15
GH15
GH15
GH152
GH154
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH16
GH17
GH17
GH17
GH17
GH17
GH18
GH18
GH18
GH18
GH18
GH18
GH2
GH2
GH20
GH20
GH26
GH27
GH27
GH27
GH27
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH28
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH3
GH31
GH31
GH31
GH31
GH31
GH32
GH32
GH35
GH35
GH35
GH35
GH35
GH36
GH36
GH37
GH38
GH38
GH39
GH43
GH43
GH43
GH43
GH43
GH43
GH43
GH43
GH45
GH45
GH47
GH47
GH47
GH47
GH47
GH47
GH47
GH47
GH47
GH47
GH5
GH5
GH5
GH5
GH5
GH5
GH5
GH5
GH5
GH5
GH5
GH5
GH5
GH51
GH51

In [29]:
hotpep_ec_predictions = {}
for line in hotpep_file[1:5]:
    line = line.split("\t")
    ec_numbers = line[-1]
    ec_numbers = ec_numbers.split(",")  # multiple EC numbers may be predicted
    # formate EC numbers
    index = 0
    for index in range(len(ec_numbers)):
        ec = ec_numbers[index].split(":")[0] # remove the (":score") from the EC number
        if ec == "NA":  # change represented null value to true null value
            ec_numbers[index] = np.nan
        else:
            ec_numbers[index] = ec
    gene_id = line[2]
    hotpep_ec_predictions[gene_id] = ec_numbers

hotpep_ec_predictions

{'XP_001551543.2': ['3.1.1.-'],
 'XP_024553067.1': [nan],
 'XP_024553068.1': [nan],
 'XP_024553208.1': ['3.1.1.72', ' 3.1.1.-']}

I do not want to rely on the order proteins are presented in the Hotpep.out file being the exactly same order they are presented in the overview.txt file, hence the importance of associating the predicated EC numbers and gene ID. These gene IDs can be used to make sure that the correct EC number predicitons are given to the correct proteins. To do this a single dataframe containing the Hotpep data is needed, and a column (either empty or pre-populated with null values) added to have the predicated EC numbers added/inserted into.

Therefore, the next task is to build a dataframe each for the HMMER, DIAMOND and Hotpep from the overview.txt file.

What needs to be considered at this stage is that for eCAMI the output produces the main (or heighest scoring) CAZy prediction then additional predictions. However, the run_dbcan.py script shows that the predicted CAZy families are taken 'as is' straight from the prediction tool ouput for HMMER, DIAMOND and Hotpep, meaning they do not appear to be order with the heighest scoring family listed first. -- NEED TO CHECK THE PAPERS THAT DISCUSS THE TOOLS INDIVIDUALLY!!!


In [30]:
def parse_dbcan_output(overview_file_path):
    """Parse the output from the run_dbCAN overview.txt file, writing a dataframe from each prediciotn tool.
    
    :param overview_file_path: path, path to the output 'overview.txt' file
    
    Return 3 dataframes, one each for HMMER, Hotpep and DIAMOND
    """
    with open(overview_file_path, "r") as fh:
        overview_file = fh.read().splitlines()
        
    # build empty dataframes to add the repsective prediction tools output to
    hmmer_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily", "domain_range"])
    hotpep_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily"]) # add ec later
    diamond_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily"])

    for line in overview_file[1:]: # skip the first line becuase this is the head titles
        line = line.split("\t")
        
        # retrieve the data for HMMER
        hmmer_df = hmmer_df.append(parse_hmmer_output(line), ignore_index=True)
        
        # retrieve the data for Hotpep
        hotpep_df = hotpep_df.append(parse_hotpep_output(line), ignore_index=True)
        
        # retrieve the data for DIAMOND
        diamond_df = diamond_df.append(parse_diamond_output(line), ignore_index=True)

    return hmmer_df, hotpep_df, diamond_df


def parse_hmmer_output(line):
    """Parse the output from HMMER from the overview.txt file.
    
    Retrieve the protein accession, predicated CAZy families, accomanying subfamiles and domain ranges.
    Multiple domains can be predicated, and the families, subfamiles and ranges will be collected as
    as lists, with the index for each item in each list being associated with the matching items (by index)
    in the other lists. For example, the first item in every list will contain the prediciton for the first
    predicated domain, the second item the second prediction etc.
    
    :param line: str, line from the dbcan overview.txt file
    
    Return pandas dataframe."""
    # Retrieve predictions from HMMER (predicated domains and domain ranges)
    # separate out each of the domain predications
    hmmer_prediction = line[1].split("+")
    
    # create empty lists to store predicated domains and ranges
    cazy_fam = []
    cazy_subfam = []
    domain_ranges = []

    # separate out the name of the predicated domain (the CAZy family) and the domain's AA range
    for domain in hmmer_prediction:
        domain = domain.split("(")
        # standardise name if necessary, retrieve subfam if preciated, and produce null values
        domain_name = domain[0]
        if domain_name.startswith("GT2_Chitin_"):
            cazy_fam.append("GT2")
            cazy_subfam.append(np.nan)

        elif domain_name == '-':
            cazy_fam.append(np.nan)
            cazy_subfam.append(np.nan)

        elif domain_name.find("_") != -1:
            cutoff = domain_name.find("_")
            cazy_fam.append(domain_name[:cutoff])
            cazy_subfam.append(domain_name)

        else:
            cazy_fam.append(domain_name)
            cazy_subfam.append(np.nan)

        try:
            if domain[1] == "-":
                domain_ranges.append(np.nan)
            else:
                domain_ranges.append(domain[1][:-1])  # exlude the final ")"
        except IndexError: # no domain predicated so no domain ranges
            domain_ranges.append(np.nan)

        
    prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family": [cazy_fam],
        "cazy_subfamily": [cazy_subfam],
        "domain_range": [domain_ranges],
    }

    prediction_df = pd.DataFrame(prediction_dict)
    
    return prediction_df


def parse_hotpep_output(line):
    """Parse the output from Hotpep from the overview.txt file.
    
    Retrieve the protein accession, predicated CAZy families, accomanying subfamiles.
    Multiple domains can be predicated, and the families and subfamiles will be collected as
    as lists, with the index for each item in each list being associated with the matching items (by index)
    in the other lists. For example, the first item in every list will contain the prediciton for the first
    predicated domain, the second item the second prediction etc.
    
    :param line: str, line from the dbcan overview.txt file
    
    Return pandas dataframe."""
    # Retrieve predictions from Hotpep
    hotpep_prediction = line[2].split("+")

    # create empty lists to separate the predicated CAZy fam and subfam
    cazy_fams = []
    cazy_subfams = []

    # remove the k-mer cluster number
    for prediction in hotpep_prediction:
        prediction = prediction.split("(")[0]

        if prediction == "-":
            cazy_fams.append(np.nan)
            cazy_subfams.append(np.nan)

        elif prediction.find("_") != -1:
            cutoff = prediction.find("_")
            cazy_fams.append(predictions[:cutoff])
            cazy_subfams.append(prediction)

        else:
            cazy_fams.append(prediction)
            cazy_subfams.append(np.nan)

    prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family": [cazy_fams],
        "cazy_subfamily": [cazy_subfams],
    }
    
    prediction_df = pd.DataFrame(prediction_dict)
    
    return prediction_df


def parse_diamond_output(line):
    """Parse the output from DIAMOND from the overview.txt file.
    
    Retrieve the protein accession, predicated CAZy families, accomanying subfamiles.
    Multiple domains can be predicated, and the families and subfamiles will be collected as
    as lists, with the index for each item in each list being associated with the matching items (by index)
    in the other lists. For example, the first item in every list will contain the prediciton for the first
    predicated domain, the second item the second prediction etc.
    
    :param line: str, line from the dbcan overview.txt file
    
    Return pandas dataframe."""
    
    # Retrieve domaind predications from DIAMOND
    diamond_prediction = line[3].split('+')

    # create empty lists to separate the CAZy family and subfamiles
    cazy_fams = []
    cazy_subfams = []

    for pred in diamond_prediction:
        if pred == '-':
            cazy_fams.append(np.nan)
            cazy_subfams.append(np.nan)

        elif pred.find("_") != -1:
            cutoff = pred.find("_")
            cazy_fams.append(pred[:cutoff])
            cazy_subfams.append(pred)

        else:
            cazy_fams.append(pred)
            cazy_subfams.append(np.nan)

    prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family": [cazy_fams],
        "cazy_subfamily": [cazy_subfams],
    }
    
    prediction_df = pd.DataFrame(prediction_dict)
    
    return prediction_df


hmmer_df, hotpep_df, diamond_df = parse_dbcan_output("overview.txt")
print("done")

done


In [31]:
hmmer_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily,domain_range
0,XP_001545265.2,[GH53],[nan],[20-346]
1,XP_001545514.1,[GT2],[nan],[1213-1720]
2,XP_001545605.1,[AA4],[nan],[28-577]
3,XP_001545700.1,[CBM21],[nan],[353-459]
4,XP_001545706.2,[GH132],[nan],[57-318]
...,...,...,...,...
701,XP_024553622.1,[nan],[nan],[nan]
702,XP_001546886.1,[nan],[nan],[nan]
703,XP_024553655.1,[nan],[nan],[nan]
704,XP_024553815.1,[nan],[nan],[nan]


In [32]:
hotpep_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily
0,XP_001545265.2,[GH53],[nan]
1,XP_001545514.1,[GT2],[nan]
2,XP_001545605.1,[AA4],[nan]
3,XP_001545700.1,[CBM21],[nan]
4,XP_001545706.2,[GH132],[nan]
...,...,...,...
701,XP_024553622.1,[nan],[nan]
702,XP_001546886.1,[nan],[nan]
703,XP_024553655.1,[nan],[nan]
704,XP_024553815.1,[nan],[nan]


In [33]:
diamond_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily
0,XP_001545265.2,[GH53],[nan]
1,XP_001545514.1,[GT2],[nan]
2,XP_001545605.1,[AA4],[nan]
3,XP_001545700.1,[CBM21],[nan]
4,XP_001545706.2,[GH132],[nan]
...,...,...,...
701,XP_024553622.1,[GH30],[GH30_7]
702,XP_001546886.1,[GH30],[GH30_7]
703,XP_024553655.1,"[CBM24, GH71]","[nan, nan]"
704,XP_024553815.1,[CBM13],[nan]


This produces a dataframe each for DIAMOND, HMMER and Hotpep. But we want to add the predicted EC numbers from Hotpep.out to the Hotpep dataframe.

In [34]:
# add a column full of null values for EC numbers
hotpep_df["ec_number"] = np.nan

# retrieve EC numbers from Hotpep.out and add to the corresponding protein in the Hotpep df


def add_hotpep_ec_predictions(hotpep_output_file, hotpep_df):
    """Retrieve predicated EC numbers from Hotpep output file and add to the Hotpep dataframe.
    
    :param hotpep_output_file: path, path to Hotpep.out file
    :param hotpep_df: pandas dataframe, containing Hotpep predicated CAZy families
    
    Return pandas dataframe (hotpep_df with EC number predictions)
    """
    with open(hotpep_output_file, "r") as fh:
        hotpep_file = fh.read().splitlines()
        
        ec_predictions = {}
        
        for line in hotpep_file[1:]:  # skip the first line which contains the titles
            line = line.split("\t")
            
            ec_numbers = line[-1].split(",") # multiple EC numbers may be predicted
            
            # remove (":score") from each EC number, and convert "NA" to proper null value
            index = 0
            for index in range(len(ec_numbers)):
                ec = ec_numbers[index].split(":")[0] # remove the (":score") from the EC number
                
                if ec == "NA":
                    ec_numbers[index] = np.nan
                
                else:
                    ec_numbers[index] = ec
            
            protein_accession = line[2]
            ec_predictions[protein_accession] = ec_numbers
        

    for protein_accession in ec_predictions:  # dict {protein_accession: [predicated EC#s]}
        # Check if there are EC numbers to add
        if type(ec_predictions[protein_accession][0]) == float:  # EC numbers is not stored as null value
            continue
            
        # find index of row in hotpep_df with the same protein accession
        row_index = hotpep_df.index[hotpep_df["protein_accession"] == protein_accession].tolist()[0]
        # add EC numbers to this row
        string = ""
        for item in ec_predictions[protein_accession][:-1]:
            string += f"{item},"
        string += ec_predictions[protein_accession][((len(ec_predictions[protein_accession])) - 1)]
        last_item_index = ((len(ec_predictions[protein_accession])) - 1)
        hotpep_df.iloc[row_index, 3] = string
    
    return hotpep_df

hotpep_df = add_hotpep_ec_predictions("Hotpep.out", hotpep_df)
# returns the dict here so can check values were added ok
hotpep_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily,ec_number
0,XP_001545265.2,[GH53],[nan],3.2.1.89
1,XP_001545514.1,[GT2],[nan],2.4.1.16
2,XP_001545605.1,[AA4],[nan],1.1.3.38
3,XP_001545700.1,[CBM21],[nan],
4,XP_001545706.2,[GH132],[nan],
...,...,...,...,...
701,XP_024553622.1,[nan],[nan],
702,XP_001546886.1,[nan],[nan],
703,XP_024553655.1,[nan],[nan],
704,XP_024553815.1,[nan],[nan],


**dbCAN consensus result**

The final task is to create a consensus result for the dbCAN tool.

A consensus result will be defined as the result of which 2 or more tools agree. For example, if only the #ofTools count for a protein is 1 then the dbCAN consensus is that the query protein is not a CAZyme - that's the easy one to determine.

The difficulty arises when #ofTools => 2 but the tools predicated domains, or one predicated multiple domains and the other only one. This will involve checking if a domain is listed in the others predictions.

Again this involves parsing the overview.txt file. At the end this will be added to the parse_dbcan_output() function but for now it will be developed in isolation.

Another issue to deal with is that the dataframes containing the standardised output from tool should be in the same order, becusae we iterated through the lines of the overview file added the data from each line to the hmmer, hotpep and diamond dfs at the same time. Therefore, you could use the dataframes to get the standardised results for each prediciton tool and find the consensus, or build the dbCAN df at the same times as the others. The former method appears to be potentially brittle and thus the latter approach will be taken.


In [35]:
def parse_dbcan_output(overview_file_path):
    """Parse the output from the run_dbCAN overview.txt file, writing a dataframe from each prediciotn tool.
    
    :param overview_file_path: path, path to the output 'overview.txt' file
    
    Return 4 dataframes, one each for consensus dbCAN result, HMMER, Hotpep and DIAMOND
    """
    with open(overview_file_path, "r") as fh:
        overview_file = fh.read().splitlines()
        
    # build empty dataframes to add the repsective prediction tools output to
    hmmer_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily", "domain_range"])
    hotpep_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily"]) # add ec later
    diamond_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily"])
    dbcan_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family_3", "cazy_subfamily_3", "cazy_family_2", "cazy_subfamily_2"])

    for line in overview_file[1:]: # skip the first line becuase this is the head titles
        line = line.split("\t")
        
        # retrieve the data for HMMER
        hmmer_temp_df = parse_hmmer_output(line)
        hmmer_df = hmmer_df.append(hmmer_temp_df, ignore_index=True)
        
        # retrieve the data for Hotpep
        hotpep_temp_df = parse_hotpep_output(line)
        hotpep_df = hotpep_df.append(hotpep_temp_df, ignore_index=True)
        
        # retrieve the data for DIAMOND
        diamond_temp_df = parse_diamond_output(line)
        diamond_df = diamond_df.append(diamond_temp_df, ignore_index=True)
        
        # retrieve consensus results for dbCAN
        if line[-1] != "1":   #ofTools != 1, if ==1, same approach as other tools, do not include if non-CAZyme prediction
            fam_consensus_3, fam_consensus_2 = get_dbcan_consensus(
                hmmer_temp_df.iloc[0, 1],
                hotpep_temp_df.iloc[0, 1],
                diamond_temp_df.iloc[0, 1],
            )
            subfam_consensus_3, subfam_consensus_2 = get_dbcan_consensus(
                hmmer_temp_df.iloc[0, 2],
                hotpep_temp_df.iloc[0, 2],
                diamond_temp_df.iloc[0, 2],
            )

            consensus_dict = {
                "protein_accession": [line[0]],
                "cazy_family_3": [fam_consensus_3],
                "cazy_subfamily_3": [subfam_consensus_3],
                "cazy_family_2": [fam_consensus_2],
                "cazy_subfamily_2": [subfam_consensus_2],
            }
            
            temp_consensus_df = pd.DataFrame(consensus_dict)
            
            dbcan_df = dbcan_df.append(temp_consensus_df, ignore_index=True)

    return dbcan_df, hmmer_df, hotpep_df, diamond_df


def parse_hmmer_output(line):
    """Parse the output from HMMER from the overview.txt file.
    
    Retrieve the protein accession, predicated CAZy families, accomanying subfamiles and domain ranges.
    Multiple domains can be predicated, and the families, subfamiles and ranges will be collected as
    as lists, with the index for each item in each list being associated with the matching items (by index)
    in the other lists. For example, the first item in every list will contain the prediciton for the first
    predicated domain, the second item the second prediction etc.
    
    :param line: str, line from the dbcan overview.txt file
    
    Return pandas dataframe."""
    # Retrieve predictions from HMMER (predicated domains and domain ranges)
    # separate out each of the domain predications
    hmmer_prediction = line[1].split("+")
    
    # create empty lists to store predicated domains and ranges
    cazy_fam = []
    cazy_subfam = []
    domain_ranges = []

    # separate out the name of the predicated domain (the CAZy family) and the domain's AA range
    for domain in hmmer_prediction:
        domain = domain.split("(")
        # standardise name if necessary, retrieve subfam if preciated, and produce null values
        domain_name = domain[0]
        if domain_name.startswith("GT2_Chitin_"):
            cazy_fam.append("GT2")
            cazy_subfam.append(np.nan)

        elif domain_name == '-':
            cazy_fam.append(np.nan)
            cazy_subfam.append(np.nan)

        elif domain_name.find("_") != -1:
            cutoff = domain_name.find("_")
            cazy_fam.append(domain_name[:cutoff])
            cazy_subfam.append(domain_name)

        else:
            cazy_fam.append(domain_name)
            cazy_subfam.append(np.nan)

        try:
            if domain[1] == "-":
                domain_ranges.append(np.nan)
            else:
                domain_ranges.append(domain[1][:-1])  # exlude the final ")"
        except IndexError: # no domain predicated so no domain ranges
            domain_ranges.append(np.nan)

        
    prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family": [cazy_fam],
        "cazy_subfamily": [cazy_subfam],
        "domain_range": [domain_ranges],
    }

    prediction_df = pd.DataFrame(prediction_dict)
    
    return prediction_df


def parse_hotpep_output(line):
    """Parse the output from Hotpep from the overview.txt file.
    
    Retrieve the protein accession, predicated CAZy families, accomanying subfamiles.
    Multiple domains can be predicated, and the families and subfamiles will be collected as
    as lists, with the index for each item in each list being associated with the matching items (by index)
    in the other lists. For example, the first item in every list will contain the prediciton for the first
    predicated domain, the second item the second prediction etc.
    
    :param line: str, line from the dbcan overview.txt file
    
    Return pandas dataframe."""
    # Retrieve predictions from Hotpep
    hotpep_prediction = line[2].split("+")

    # create empty lists to separate the predicated CAZy fam and subfam
    cazy_fams = []
    cazy_subfams = []

    # remove the k-mer cluster number
    for prediction in hotpep_prediction:
        prediction = prediction.split("(")[0]

        if prediction == "-":
            cazy_fams.append(np.nan)
            cazy_subfams.append(np.nan)

        elif prediction.find("_") != -1:
            cutoff = prediction.find("_")
            cazy_fams.append(predictions[:cutoff])
            cazy_subfams.append(prediction)

        else:
            cazy_fams.append(prediction)
            cazy_subfams.append(np.nan)

    prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family": [cazy_fams],
        "cazy_subfamily": [cazy_subfams],
    }
    
    prediction_df = pd.DataFrame(prediction_dict)
    
    return prediction_df


def parse_diamond_output(line):
    """Parse the output from DIAMOND from the overview.txt file.
    
    Retrieve the protein accession, predicated CAZy families, accomanying subfamiles.
    Multiple domains can be predicated, and the families and subfamiles will be collected as
    as lists, with the index for each item in each list being associated with the matching items (by index)
    in the other lists. For example, the first item in every list will contain the prediciton for the first
    predicated domain, the second item the second prediction etc.
    
    :param line: str, line from the dbcan overview.txt file
    
    Return pandas dataframe."""
    
    # Retrieve domaind predications from DIAMOND
    diamond_prediction = line[3].split('+')

    # create empty lists to separate the CAZy family and subfamiles
    cazy_fams = []
    cazy_subfams = []

    for pred in diamond_prediction:
        if pred == '-':
            cazy_fams.append(np.nan)
            cazy_subfams.append(np.nan)

        elif pred.find("_") != -1:
            cutoff = pred.find("_")
            cazy_fams.append(pred[:cutoff])
            cazy_subfams.append(pred)

        else:
            cazy_fams.append(pred)
            cazy_subfams.append(np.nan)

    prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family": [cazy_fams],
        "cazy_subfamily": [cazy_subfams],
    }
    
    prediction_df = pd.DataFrame(prediction_dict)
    
    return prediction_df


def get_dbcan_consensus(hmmer, hotpep, diamond):
    """Get consensus results across HMMER, Hotpep and DIAMOND.
    
    Retrieves list of items common to all three tools, and another list of items common to two of the tools. Then
    builds a dataframe storing the data.
    
    :param hmmer: list of predictions from HMMER
    :param hotpep: list of predictions from Hotpep
    :param diamond: list of predictions from DIAMOND
    
    Return two lists, items where all 3 tools agree, items where two 2 tools agree
    """
    # Retrieve list of items predicated by all three tools
    consensus_3 = list(set(hmmer) & set(hotpep) & set(diamond))
    if len(consensus_3) == 0:
        consensus_3 = [np.nan]
    
    # Retrieve list of items predicated by two of the tools
    consensus_2 = list(set(hmmer) & set(hotpep))
    consensus_2 += list(set(hmmer) & set(diamond))
    consensus_2 += list(set(hotpep) & set(diamond))
    
    # remove duplicates and items in consensus for all 3 tools
    consensus_2 = list(dict.fromkeys(consensus_2))
    for item in consensus_2:
        if item in consensus_3:
            consensus_2.remove(item)
    
    if len(consensus_2) == 0:
        consensus_2 = [np.nan]
    
    return consensus_3, consensus_2


dbcan_df, hmmer_df, hotpep_df, diamond_df = parse_dbcan_output("overview.txt")
dbcan_df

Unnamed: 0,protein_accession,cazy_family_3,cazy_subfamily_3,cazy_family_2,cazy_subfamily_2
0,XP_001545265.2,[GH53],[nan],[nan],[nan]
1,XP_001545514.1,[GT2],[nan],[nan],[nan]
2,XP_001545605.1,[AA4],[nan],[nan],[nan]
3,XP_001545700.1,[CBM21],[nan],[nan],[nan]
4,XP_001545706.2,[GH132],[nan],[nan],[nan]
...,...,...,...,...,...
504,XP_001557402.1,[nan],[nan],[GT41],[nan]
505,XP_024547082.1,[nan],[nan],[CBM1],[nan]
506,XP_024550885.1,[nan],[nan],[CBM18],[nan]
507,XP_001551566.1,[nan],[nan],[CBM18],[nan]


In [36]:
hmmer_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily,domain_range
0,XP_001545265.2,[GH53],[nan],[20-346]
1,XP_001545514.1,[GT2],[nan],[1213-1720]
2,XP_001545605.1,[AA4],[nan],[28-577]
3,XP_001545700.1,[CBM21],[nan],[353-459]
4,XP_001545706.2,[GH132],[nan],[57-318]
...,...,...,...,...
701,XP_024553622.1,[nan],[nan],[nan]
702,XP_001546886.1,[nan],[nan],[nan]
703,XP_024553655.1,[nan],[nan],[nan]
704,XP_024553815.1,[nan],[nan],[nan]


In [37]:
hotpep_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily
0,XP_001545265.2,[GH53],[nan]
1,XP_001545514.1,[GT2],[nan]
2,XP_001545605.1,[AA4],[nan]
3,XP_001545700.1,[CBM21],[nan]
4,XP_001545706.2,[GH132],[nan]
...,...,...,...
701,XP_024553622.1,[nan],[nan]
702,XP_001546886.1,[nan],[nan]
703,XP_024553655.1,[nan],[nan]
704,XP_024553815.1,[nan],[nan]


In [38]:
diamond_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily
0,XP_001545265.2,[GH53],[nan]
1,XP_001545514.1,[GT2],[nan]
2,XP_001545605.1,[AA4],[nan]
3,XP_001545700.1,[CBM21],[nan]
4,XP_001545706.2,[GH132],[nan]
...,...,...,...
701,XP_024553622.1,[GH30],[GH30_7]
702,XP_001546886.1,[GH30],[GH30_7]
703,XP_024553655.1,"[CBM24, GH71]","[nan, nan]"
704,XP_024553815.1,[CBM13],[nan]


Although it seems brilliant to separate out the domains that have been predicated by 3 tools, and those that have been predicated by 2, the dbCAN papers say that a consensus result is a result where at least 2 tools agree. Additionally, having a single columns of 'cazy_family' and 'cazy_subfamily' will make the statistical evaluation of the performance down the line much easier, becuase all the dataframes will have the same titled columns.

Therefore, here is the final version of the parse_dbcan_output function group, where the dbCAN consensus result is defined as any domains that have been predicated by at least 2 tools within dbCAN.


In [39]:
def parse_dbcan_output(overview_file_path):
    """Parse the output from the run_dbCAN overview.txt file, writing a dataframe from each prediciotn tool.
    
    :param overview_file_path: path, path to the output 'overview.txt' file
    
    Return 4 dataframes, one each for consensus dbCAN result, HMMER, Hotpep and DIAMOND
    """
    with open(overview_file_path, "r") as fh:
        overview_file = fh.read().splitlines()
        
    # build empty dataframes to add the repsective prediction tools output to
    hmmer_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily", "domain_range"])
    hotpep_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily"]) # add ec later
    diamond_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily"])
    dbcan_df = pd.DataFrame({}, columns=["protein_accession", "cazy_family", "cazy_subfamily"])

    for line in overview_file[1:]: # skip the first line becuase this is the head titles
        line = line.split("\t")
        
        # retrieve the data for HMMER
        hmmer_temp_df = parse_hmmer_output(line)
        hmmer_df = hmmer_df.append(hmmer_temp_df, ignore_index=True)
        
        # retrieve the data for Hotpep
        hotpep_temp_df = parse_hotpep_output(line)
        hotpep_df = hotpep_df.append(hotpep_temp_df, ignore_index=True)
        
        # retrieve the data for DIAMOND
        diamond_temp_df = parse_diamond_output(line)
        diamond_df = diamond_df.append(diamond_temp_df, ignore_index=True)
        
        # retrieve consensus results for dbCAN
        if line[-1] != "1":   #c heck #ofTools, same approach as other tools, don't include if non-CAZyme prediction
            consensus_cazy_fam = get_dbcan_consensus(
                hmmer_temp_df.iloc[0, 1],
                hotpep_temp_df.iloc[0, 1],
                diamond_temp_df.iloc[0, 1],
            )
            consensus_sub_fam = get_dbcan_consensus(
                hmmer_temp_df.iloc[0, 2],
                hotpep_temp_df.iloc[0, 2],
                diamond_temp_df.iloc[0, 2],
            )

            consensus_dict = {
                "protein_accession": [line[0]],
                "cazy_family": [consensus_cazy_fam],
                "cazy_subfamily": [consensus_sub_fam],
            }
            
            temp_consensus_df = pd.DataFrame(consensus_dict)
            
            dbcan_df = dbcan_df.append(temp_consensus_df, ignore_index=True)

    return dbcan_df, hmmer_df, hotpep_df, diamond_df


def parse_hmmer_output(line):
    """Parse the output from HMMER from the overview.txt file.
    
    Retrieve the protein accession, predicated CAZy families, accomanying subfamiles and domain ranges.
    Multiple domains can be predicated, and the families, subfamiles and ranges will be collected as
    as lists, with the index for each item in each list being associated with the matching items (by index)
    in the other lists. For example, the first item in every list will contain the prediciton for the first
    predicated domain, the second item the second prediction etc.
    
    :param line: str, line from the dbcan overview.txt file
    
    Return pandas dataframe."""
    # Retrieve predictions from HMMER (predicated domains and domain ranges)
    # separate out each of the domain predications
    hmmer_prediction = line[1].split("+")
    
    # create empty lists to store predicated domains and ranges
    cazy_fam = []
    cazy_subfam = []
    domain_ranges = []

    # separate out the name of the predicated domain (the CAZy family) and the domain's AA range
    for domain in hmmer_prediction:
        domain = domain.split("(")
        # standardise name if necessary, retrieve subfam if preciated, and produce null values
        domain_name = domain[0]
        if domain_name.startswith("GT2_Chitin_"):
            cazy_fam.append("GT2")
            cazy_subfam.append(np.nan)

        elif domain_name == '-':
            cazy_fam.append(np.nan)
            cazy_subfam.append(np.nan)

        elif domain_name.find("_") != -1:
            cutoff = domain_name.find("_")
            cazy_fam.append(domain_name[:cutoff])
            cazy_subfam.append(domain_name)

        else:
            cazy_fam.append(domain_name)
            cazy_subfam.append(np.nan)

        try:
            if domain[1] == "-":
                domain_ranges.append(np.nan)
            else:
                domain_ranges.append(domain[1][:-1])  # exlude the final ")"
        except IndexError: # no domain predicated so no domain ranges
            domain_ranges.append(np.nan)

        
    prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family": [cazy_fam],
        "cazy_subfamily": [cazy_subfam],
        "domain_range": [domain_ranges],
    }

    prediction_df = pd.DataFrame(prediction_dict)
    
    return prediction_df


def parse_hotpep_output(line):
    """Parse the output from Hotpep from the overview.txt file.
    
    Retrieve the protein accession, predicated CAZy families, accomanying subfamiles.
    Multiple domains can be predicated, and the families and subfamiles will be collected as
    as lists, with the index for each item in each list being associated with the matching items (by index)
    in the other lists. For example, the first item in every list will contain the prediciton for the first
    predicated domain, the second item the second prediction etc.
    
    :param line: str, line from the dbcan overview.txt file
    
    Return pandas dataframe."""
    # Retrieve predictions from Hotpep
    hotpep_prediction = line[2].split("+")

    # create empty lists to separate the predicated CAZy fam and subfam
    cazy_fams = []
    cazy_subfams = []

    # remove the k-mer cluster number
    for prediction in hotpep_prediction:
        prediction = prediction.split("(")[0]

        if prediction == "-":
            cazy_fams.append(np.nan)
            cazy_subfams.append(np.nan)

        elif prediction.find("_") != -1:
            cutoff = prediction.find("_")
            cazy_fams.append(predictions[:cutoff])
            cazy_subfams.append(prediction)

        else:
            cazy_fams.append(prediction)
            cazy_subfams.append(np.nan)

    prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family": [cazy_fams],
        "cazy_subfamily": [cazy_subfams],
    }
    
    prediction_df = pd.DataFrame(prediction_dict)
    
    return prediction_df


def parse_diamond_output(line):
    """Parse the output from DIAMOND from the overview.txt file.
    
    Retrieve the protein accession, predicated CAZy families, accomanying subfamiles.
    Multiple domains can be predicated, and the families and subfamiles will be collected as
    as lists, with the index for each item in each list being associated with the matching items (by index)
    in the other lists. For example, the first item in every list will contain the prediciton for the first
    predicated domain, the second item the second prediction etc.
    
    :param line: str, line from the dbcan overview.txt file
    
    Return pandas dataframe."""
    
    # Retrieve domaind predications from DIAMOND
    diamond_prediction = line[3].split('+')

    # create empty lists to separate the CAZy family and subfamiles
    cazy_fams = []
    cazy_subfams = []

    for pred in diamond_prediction:
        if pred == '-':
            cazy_fams.append(np.nan)
            cazy_subfams.append(np.nan)

        elif pred.find("_") != -1:
            cutoff = pred.find("_")
            cazy_fams.append(pred[:cutoff])
            cazy_subfams.append(pred)

        else:
            cazy_fams.append(pred)
            cazy_subfams.append(np.nan)

    prediction_dict = {
        "protein_accession": [line[0]],
        "cazy_family": [cazy_fams],
        "cazy_subfamily": [cazy_subfams],
    }
    
    prediction_df = pd.DataFrame(prediction_dict)
    
    return prediction_df


def get_dbcan_consensus(hmmer, hotpep, diamond):
    """Get consensus results across HMMER, Hotpep and DIAMOND.
    
    A consensus result is defined by a result at at least two of the tools in dbCAN have predicated. Stores the
    consensus results in a list, with no duplicates.
    
    Retrieves list of items common to all three tools, and another list of items common to two of the tools. Then
    builds a dataframe storing the data.
    
    :param hmmer: list of predictions from HMMER
    :param hotpep: list of predictions from Hotpep
    :param diamond: list of predictions from DIAMOND
    
    Returns a list.
    """
    # Retrieve list of items predicated by all three tools
    consensus = list(set(hmmer) & set(hotpep) & set(diamond))
    
    # add domains predicated by two of the tools
    consensus += list(set(hmmer) & set(hotpep))
    consensus += list(set(hmmer) & set(diamond))
    consensus += list(set(hotpep) & set(diamond))
    
    # remove duplicates
    consensus = list(dict.fromkeys(consensus))
    
    if len(consensus) == 0:
        consensus = [np.nan]
    
    return consensus


dbcan_df, hmmer_df, hotpep_df, diamond_df = parse_dbcan_output("overview.txt")
dbcan_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily
0,XP_001545265.2,[GH53],[nan]
1,XP_001545514.1,[GT2],[nan]
2,XP_001545605.1,[AA4],[nan]
3,XP_001545700.1,[CBM21],[nan]
4,XP_001545706.2,[GH132],[nan]
...,...,...,...
504,XP_001557402.1,[GT41],[nan]
505,XP_024547082.1,[CBM1],[nan]
506,XP_024550885.1,[CBM18],[nan]
507,XP_001551566.1,[CBM18],[nan]


In [40]:
hmmer_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily,domain_range
0,XP_001545265.2,[GH53],[nan],[20-346]
1,XP_001545514.1,[GT2],[nan],[1213-1720]
2,XP_001545605.1,[AA4],[nan],[28-577]
3,XP_001545700.1,[CBM21],[nan],[353-459]
4,XP_001545706.2,[GH132],[nan],[57-318]
...,...,...,...,...
701,XP_024553622.1,[nan],[nan],[nan]
702,XP_001546886.1,[nan],[nan],[nan]
703,XP_024553655.1,[nan],[nan],[nan]
704,XP_024553815.1,[nan],[nan],[nan]


In [41]:
hotpep_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily
0,XP_001545265.2,[GH53],[nan]
1,XP_001545514.1,[GT2],[nan]
2,XP_001545605.1,[AA4],[nan]
3,XP_001545700.1,[CBM21],[nan]
4,XP_001545706.2,[GH132],[nan]
...,...,...,...
701,XP_024553622.1,[nan],[nan]
702,XP_001546886.1,[nan],[nan]
703,XP_024553655.1,[nan],[nan]
704,XP_024553815.1,[nan],[nan]


In [42]:
diamond_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily
0,XP_001545265.2,[GH53],[nan]
1,XP_001545514.1,[GT2],[nan]
2,XP_001545605.1,[AA4],[nan]
3,XP_001545700.1,[CBM21],[nan]
4,XP_001545706.2,[GH132],[nan]
...,...,...,...
701,XP_024553622.1,[GH30],[GH30_7]
702,XP_001546886.1,[GH30],[GH30_7]
703,XP_024553655.1,"[CBM24, GH71]","[nan, nan]"
704,XP_024553815.1,[CBM13],[nan]
