# Parsing CUPP output

The CAZyme prediction tools dbCAN, CUPP and eCAMI write out their output files in different formats, and do not present the data in a standardised manner across all tools. Additionally, the output from the prediction tools does not enable easy parsing of the data by the [Sklearn](https://scikit-learn.org/stable/index.html) library, which performs the statistical analysis of the tools preformances.

This notebook explains the method used by the `pyrewton.cazymes.prediction.parse` submodule to parse the output from the CAZyme prediction tool CUPP.

[CUPP](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6489277/) is a k-mer based CAZyme prediction tool. The information defining the data included in each field of the output files came from the [README file](https://files.dtu.dk/userportal/#/shared/public/hLin6ni4p-SWuKfp/CUPP_program/) and the raw Python script.

The following data is collected from the CUPP output:
- protein_accession
- cazy_family (all predicated CAZy families)
- cazy_subfamily (all predicated CAZy subfamilies)
- ec_number (all predicted EC numbers)
- domain_range (the amino acid domain range of all predicted CAZy (sub)families)


### Notebook structure

A some of the information about CUPP and its data is taken from its README, but predominately the information is taken from the `CUPPprediction.py` Python script.

The section titled 'exploring the output' shows the exploration of the tools output files. This is to demonstrate how the data is constructed and presented, and forms a foundation for explaining the developed method of parsing the data.

The section 'parsing the output' shows the method of parsing the data from the CAZyme prediction tool.

The final section shows any additional parsing that was performed to make the statistical analysis of the tool's performance easier. This typically includes removing duplicates, and combining rows that represent data for the same predicted CAZyme.


## Contents

- [Imports](#notebook_imports)
- [Exploring the CUPP FASTA file](#cupp_fasta)
- [Exploring the CUPP log file](#cupp_log)
- [Choosing the CUPP output file to path](#cupp_choice)
- [Parsing the CUPP log file](#cupp_parse_log)
    - [Defining the data model](#data_model)
    - [Getting the predicted domain range](#cupp_domain_range)
    - [Getting the predicted EC number](#cupp_ec)
        - [Adding qualitying checking when getting the EC number](#cupp_ec_quality)
    - [Getting the predicted subfamily](#cupp_subfam)
        - [Standardising CAZy subfamilies (PL7:2)](#cupp_std_subfam)
- [Combing functions and populating the data model](#cupp_functions)

<a id="notebook_imports"></a>

In [1]:
# Install necessary libraries
!pip3 install pandas
!pip3 install numpy

import re

import pandas as pd
import numpy as np

from tqdm.notebook import tqdm



<a id="cupp_intro"></a>

## Understanding the output

CUPP produces two output files (a log file and a FASTA file) that contain the same data, but present it slightly differently. 

#### FASTA file
The format of the data in the output FASTA file for each protein is as follows:  
```
> accession | raw_function | best_function | additional_info | subfam \n seq
```

For example:

```
>QIX00022.1 AMS68_005539|&|AA1:1.10.3.2|AA1:1.10.3.2|AA1:6.1(20.3,124..357)|AA1_3
mvrsvnfplgvyfqykfikigtdttltweadpnreytvpsdcntspvlsgpwqyppvstsaavaiaatsttaqasavvtsqptycpntpttrncwadgfdintdfdnhwpttgrtvsytfeitnttlApDgvSRPvfaingqyPGPtIYADWgDMISVtvinklqdngttIHFHGVrQWHSNtqdgvpGLTECpIAPGTTRTyTfqatQfGTSwYHSHFSCQYGdGILGPiviqGpAtANYDIDVGpLpitdwyyptvnyhasitehtngsppvadnalINgTMTSSsggayaltklttgkrhRLrLINTSVDNHFMVSLdghqievitaDFvpIVPYNtTWimlgigqRYDVIITADQAPgnywfraevqdgcganannkniksmfaypghenetpistgsgytgrctdetqlvphwnsyvppkagfaeptffgtalnqstaadgtltlywqvngsslniqwdkptlqyvqesntsypsganvvetategqwtywviqevpgtpynvnvphplhLHGHDFFilGtgtdtwdakfsqnlnynnpprrdtamlpsngwivlawqtdnpgawLFhCHiAwhsdeglgvqfleaadqmqsvnpigtefdqqcqawhdfypvhasylkhdsgv
```

The following defines and the explains what data is included within each field of the output FASTA file.

**Accession**
This the data contained in the line starting with '>' in the input FASTA file containing the query sequences for CUPP. If the input FASTA file passed to CUPP was written output using `pyrewton` the first item in the accession is the GenBank protein accession number. This is the unique identifier for each protein.

**Raw function** The following definition was taken from the raw Python script for `CUPP.CUPPprediction.py`: The 'raw function' is produced "when consider[ing] all EC functions of a CUPP equally". This is preseted as "predicted_cazy_family : predicted_ec_number".

**Best function** The following definition was taken from the raw Python script for `CUPP.CUPPprediction.py`: the best function is the "most occuring EC function of the CUPP group". It was inferred from the Python script `CUPP.CUPPprediction.py` that the 'CUPP group' refers to the CAZy family k-mer cluster for which the query protein scored most highly. In the FASTA file this is presented as "predicted_cazy_family : predicted_ec_number". **This is the function that is taken from the output files as the CUPP CAZyme and CAZy family prediction.**

**Additional information** This includes several pieces of information that are calculated by CUPP. Owing to the manner that CUPP presents this data in the FASTA file it can be difficult to determine exactly which pieces of data have been provided. The additional data is presented in the following order (and the definitions were taken from the script `CUPP.CUPPprediction.py`):
- "Ratio of maximum frequencies"
- "Ratio of maximum frequencies by coverage"
- "Centre for mass for domain"
- "Domain range" -- this is identifable by the presense of '...' between two numbers and could potentially be useful when looking for regions of positive selection down the road
- "Number of important positions in domain" -- very ambgious of meaning

**Subfam** This is the predicted CAZy subfamily for the query protein

**Seq** This is the amino acid sequence of the query protein. It has not been determined why some letters up upper case and other lower case, although it is hypothesised that the upper case letters indicate the start of a k-mer hit found by CUPP.

#### Log file
The output log file for CUPP presents the same data as the FASTA, but presents the data in a different way. Instead of separating the data by '|' (which is used in the output FASTA file), the data is separated by tabulated white space ('\t').

Additionally, the data is presented in a different order to the output FASTA file:
- Predicated CAZy family (equivalent to the FASTA file best function)
- CUPP (k-mer cluster) group which the query protein scored highest for, and the associated score
- Additional information
- Raw_function (same as the fasta file (presented as "cazy_fam:ec#"))
- Best_function (same as the fasta file (presented as "cazy_fam:ec#")) **This is the function that is taken from the output files as the CUPP CAZyme and CAZy family prediction.**
- Subfam


<a id="cupp_fasta"></a>

### Exploring the CUPP FASTA file

The data in the CUPP FASTA file is separated by '|'.
Each protein covers two lines of data

In [2]:
# Reading the CUPP output FASTA file
with open("gca_143535_cupp.fasta", "r") as fh:
    fasta_file = fh.read().splitlines()
for line in fasta_file[:4]:
    print(line)

>XP_001545605.1 BCIN_01g00260|&|AA4:1.1.3.38|AA4:1.1.3.38|AA4:1.1(13.9,122..568)|
maasivttdtplalrvpnaattpaandplalppgtssekfaefishaeqttgaenviiirkkeeltkelytdpskahdmyhlfdkdyfvvsatiaprnveevqaivrlcndyeiplwpfsigrnvGYGgAAPRVPGSigldmgknmnkvlevnsegayallEPGvTFFSLydylkekklddqlwidtpdlgggsvvGNTIERGvGYTPYGDHFMMHCGLeVVlPTGELIRTGMGALPDPTQptveggsldqqpgnkcwqlfpygfgpYnDGIfSQSNlGIVTKMGIWLMpNPGGyQpylitfpkdtdlpkivdiirplrlqmviqnvpsirhilldaavmgtkksyydvdrplneeeldaiakelnlgrwnfygAlYgPKPvrdvlwqvvkdafgtiegakfflpedikekcvlhiraktlqgiptidelswvdwlpngahlffspiskisgedaskqyaltqkmtleagfdfigtftigmremhhiiclvfdrededqkrrahklireliqvcadngwgeYrTHIALMdqiaetygwndnaqmklnekiknaLDpKGILaPGKNGVWpAsYdrsawrldansertrtrp
>XP_024545922.1 BCIN_01g00340|&|GT2:2.4.1.117|GT2:2.4.1.117|GT2:23.1(20.2,116..300)|
mdgllelpvqilgglwevvrdtpvhvlgviligllglglgliyallllvapvprppypsektyitttpsgttetkpltcwydswvahreasvsksdscpantirtgaiepatlemslvvPaYNEEERLIgMLeealsfldttygrvargtgtgtgyeillvNDgSRDRTVEIAlDFSrknglhdvlrivtleenRGkGGAVTHGMRHVRGEYAVfADADGASRFADLGkLvkgvrevvdeeg

In [3]:
# read the raw text in the fasta file
for line in fasta_file[:4]:
    print(repr(line))
# the data is separated by '|'

'>XP_001545605.1 BCIN_01g00260|&|AA4:1.1.3.38|AA4:1.1.3.38|AA4:1.1(13.9,122..568)|'
'maasivttdtplalrvpnaattpaandplalppgtssekfaefishaeqttgaenviiirkkeeltkelytdpskahdmyhlfdkdyfvvsatiaprnveevqaivrlcndyeiplwpfsigrnvGYGgAAPRVPGSigldmgknmnkvlevnsegayallEPGvTFFSLydylkekklddqlwidtpdlgggsvvGNTIERGvGYTPYGDHFMMHCGLeVVlPTGELIRTGMGALPDPTQptveggsldqqpgnkcwqlfpygfgpYnDGIfSQSNlGIVTKMGIWLMpNPGGyQpylitfpkdtdlpkivdiirplrlqmviqnvpsirhilldaavmgtkksyydvdrplneeeldaiakelnlgrwnfygAlYgPKPvrdvlwqvvkdafgtiegakfflpedikekcvlhiraktlqgiptidelswvdwlpngahlffspiskisgedaskqyaltqkmtleagfdfigtftigmremhhiiclvfdrededqkrrahklireliqvcadngwgeYrTHIALMdqiaetygwndnaqmklnekiknaLDpKGILaPGKNGVWpAsYdrsawrldansertrtrp'
'>XP_024545922.1 BCIN_01g00340|&|GT2:2.4.1.117|GT2:2.4.1.117|GT2:23.1(20.2,116..300)|'
'mdgllelpvqilgglwevvrdtpvhvlgviligllglglgliyallllvapvprppypsektyitttpsgttetkpltcwydswvahreasvsksdscpantirtgaiepatlemslvvPaYNEEERLIgMLeealsfldttygrvargtgtgtgyeillvNDgSRDRTVEIAlDFSrknglhdvlrivtleenRGkGGAVTHGMRHVRGEYAVfADADGASRFADLGkLvkgvr

<a id="cupp_log"></a>

### Exploring the CUPP Log file

The data in the CUPP log file is separated by '\t' (tabulated white space). Each protein is contained on a single line

In [4]:
# read the log file
with open("gca_143535_cupp.fasta.log", "r") as lfh:
    log_file = lfh.read().splitlines()

for line in log_file[:4]:
    print(line)

XP_001545605.1 BCIN_01g00260	AA4	AA4:1.1	13.9	7.39	268	122..568	107	AA4:1.1.3.38	AA4:1.1.3.38	
XP_024545922.1 BCIN_01g00340	GT2	GT2:23.1	20.2	7.74	211	116..300	77	GT2:2.4.1.117	GT2:2.4.1.117	
XP_001547235.1 BCIN_01g00640	AA3	AA3:3.1	64.0	144.16	274	26..553	453			AA3_1
XP_001547254.1 BCIN_01g00800	AA1	AA1:6.1	10.5	5.13	329	202..455	98	AA1:1.10.3.2	AA1:1.10.3.2	AA1_3


In [5]:
# read the raw text in the log file
for line in log_file[:4]:
    print(repr(line))
# data is separated by '\t'

'XP_001545605.1 BCIN_01g00260\tAA4\tAA4:1.1\t13.9\t7.39\t268\t122..568\t107\tAA4:1.1.3.38\tAA4:1.1.3.38\t'
'XP_024545922.1 BCIN_01g00340\tGT2\tGT2:23.1\t20.2\t7.74\t211\t116..300\t77\tGT2:2.4.1.117\tGT2:2.4.1.117\t'
'XP_001547235.1 BCIN_01g00640\tAA3\tAA3:3.1\t64.0\t144.16\t274\t26..553\t453\t\t\tAA3_1'
'XP_001547254.1 BCIN_01g00800\tAA1\tAA1:6.1\t10.5\t5.13\t329\t202..455\t98\tAA1:1.10.3.2\tAA1:1.10.3.2\tAA1_3'


<a id="cupp_choice"></a>

### Choosing the CUPP output file to path

The output FASTA and log files of CUPP contain the same data, therefore, it is unnecessary to parse both output files.

A parser of the CUPP FASTA file would separate the protein data by the '|' separator character. This would parse the individual items of the protein data into a list, in the same order as they appear in the output FASTA file. This would allow the parser to selective retrieve the predicted CAZy families, EC numbers, etc.

However, '|' is a common data separator. A user or external database from which the input FASTA file containing the input query protein sequences for CUPP was retrieved, may include '|' characters in their FASTA files. For example, a protein data line may be written as ">XP_001545605.1|BCIN_01g00260|Botrytis cinerea B05.10|XM_001545555.2" in the CUPP input FASTA file. The entire contents of this protein data line would be included in the CUPP output FASTA file under the 'Accession' field. If the 'Accession' field contains '|', the items withint the 'Accession' field would also be separated and thus disrupt the selective retrieval of data.

The code blocks below demonstrate this theory.

In [6]:
# Example that would work
protein_data_line = '>XP_024545922.1 BCIN_01g00340|&|GT2:2.4.1.117|GT2:2.4.1.117|GT2:23.1(20.2,116..300)|'
protein_data = protein_data_line.split("|")
protein_data

['>XP_024545922.1 BCIN_01g00340',
 '&',
 'GT2:2.4.1.117',
 'GT2:2.4.1.117',
 'GT2:23.1(20.2,116..300)',
 '']

In [7]:
print("best function=", protein_data[3])

best function= GT2:2.4.1.117


In [8]:
# Example that would cause the incorrect data to be retrieved
protein_data_line = '>XP_001545605.1|BCIN_01g00260|Botrytis cinerea B05.10|XM_001545555.2|&|GT2:2.4.1.117|GT2:2.4.1.117|GT2:23.1(20.2,116..300)|'
protein_data = protein_data_line.split("|")
protein_data

['>XP_001545605.1',
 'BCIN_01g00260',
 'Botrytis cinerea B05.10',
 'XM_001545555.2',
 '&',
 'GT2:2.4.1.117',
 'GT2:2.4.1.117',
 'GT2:23.1(20.2,116..300)',
 '']

In [9]:
print("best function=", protein_data[3])

best function= XM_001545555.2


Using '|' as a separator is common. Therefore, using '|' to identify separations between individual items of data may be a brittle approach when parsing the output from CUPP.

The data in the output Log file is separated by tabs '\t'. This is less common that '|' and was therefore, deemed the least brittle of the two files to parse. Consequently, a method to parse the CUPP output log file was developed.

<a id="cupp_parse_log"></a>

## Parsing the CUPP log file

The first task is to separate the out the data fields created by CUPP into separate elements in a list.

<a id="data_model"></a>

### Defining the data model

The explanation of the design of the data model is explained in the note book prediction_data_model.ipynb.


In [10]:
class CazymeDomain:
    """Single CAZyme domain in a protein, predicted by a CAZyme prediction tool.

    Each unique CAZy domain per protein is identifiable by a unique CAZy
    family-subfamily combination.

    Every CAZyme domain has a source CAZyme prediction tool that predicted the CAZyme
    domain, a parent CAZyme protein (represented by the protein accession), and CAZy
    family and subfamily combination. If no CAZy subfamily is predicted, the 
    subfamily will be listed as a null value.

    Hotpep, CUPP and eCAMI predict EC numbers for each CAZyme domain.
    HMMER and CUPP predict amino acid domain ranges.
    Multiple EC numbers and domain ranges can be predicted for a single CAZyme domain,
    therefore, these attritbutes are stored as lists.
    """

    def __init__(
        self,
        prediction_tool,
        protein_accession,
        cazy_family,
        cazy_subfamily=None,
        ec_numbers=None,
        domain_range=None,
    ):
        """Initiate instance
        
        :attr prediction_tool: str, CAZyme prediciton tool which predicted the domain
        :attr protein_accession: str
        :attr cazy_family: str
        :attr cazy_subfamily: str
        :attr ec_numbers: list (list of str, each str contains a unique EC number)
        :attr domain_range: list (list of str, each str contains a unique domain range)
        """
        self.prediction_tool = prediction_tool
        self.protein_accession = protein_accession
        self.cazy_family = cazy_family
        
        # not all CAZyme domans are catalogued under a CAZy subfamily
        if cazy_subfamily is None:
            self.cazy_subfamiy = np.nan
        else:
            self.cazy_subfamily = cazy_subfamily
        
        # EC numbers are not predicted for all CAZyme domains
        if ec_numbers is None:
            self.ec_numbers = []  # enables adding in EC numbers included in another line of the output file
        else:
            self.ec_numbers = ec_numbers
        
        # Not all prediction tools predict CAZyme domains
        if domain_range is None:
            self.domain_range = []  # enables adding domain range listed in another line of the ouput file
        else:
            self.domain_range = domain_range
    
    def __str__(self):
        return f"-CazymeDomain in {self.protein_accession}, fam={self.cazy_family}, subfam={self.cazy_subfamily}-"
    
    def __repr__(self):
        return f"<CazymeDomain parent={self.protein_accession} fam={self.cazy_family}, subfam={self.cazy_subfamily}>"

    
class CazymeProteinPrediction:
    """Single protein and CAZyme/non-CAZyme prediction by a CAZyme prediction tool"""
    
    def __init__(self, prediction_tool, protein_accession, cazyme_classification, cazyme_domains=None):
        """Initate class instance.
        
        :attr prediction_tool: str, name of CAZyme prediction tool
        :attr protein_accession: str
        :attr cazyme_classification: int, 1=CAZyme, 0=non-CAZyme
        :attr cazyme_domains: list of CazymeDomain instances, domain predicted to be in the CAZyme
        """
        self.prediction_tool = prediction_tool
        self.protein_accession = protein_accession
        self.cazyme_classification = cazyme_classification  # CAZyme=1, non-CAZyme=0
        
        # non-CAZymes will have no cazyme_domains
        if cazyme_domains is None:
            self.cazyme_domains = []  # enables adding domain predictions too
        else:
            self.cazyme_domains = cazyme_domains
    
    def __str__(self):
        if self.cazyme_classification == 0:
            return f"-CazymeProteinPrediction, protein={self.protein_accession}, non-CAZyme-"
        else:
            return f"-CazymeProteinPrediction, protein={self.protein_accession}, CAZyme domains={len(self.cazyme_domains)}-"
    
    def __repr__(self):
        return(
            f"<CazymeProteinPrediction, protein={self.protein_accession}, "
            f"cazyme_classification{self.cazyme_classification}>"
        )


Now lets split up the data in the CUPP log file to prepare it for adding it to the data model. Additionally, we create a dictionary, key by the protein accession and valued by its corresponding CUPPprediction instance. This is to enable quick checking if a CUPPprediction instance has already been created for a protein.

In [11]:
cupp_predictions = {}  # stores proteins {protein_accession:CUPPprediction_instance}

# first split the line into the individual pieces of data
for line in log_file[:2]:
    prediction_output = line.split("\t")
    print(prediction_output)

['XP_001545605.1 BCIN_01g00260', 'AA4', 'AA4:1.1', '13.9', '7.39', '268', '122..568', '107', 'AA4:1.1.3.38', 'AA4:1.1.3.38', '']
['XP_024545922.1 BCIN_01g00340', 'GT2', 'GT2:23.1', '20.2', '7.74', '211', '116..300', '77', 'GT2:2.4.1.117', 'GT2:2.4.1.117', '']


<a id="cupp_domain_range"></a>

### Getting the predicted domain range

The domain range is stored within the 'additional data' field by CUPP. However, if CUPP does not commute some of the fields within 'additional data' an empty object (such as an empty string) is _not_ written out to the output FASTA or log files.

Reminder of the data included in the 'additional data' field by CUPP:
- "Ratio of maximum frequencies"
- "Ratio of maximum frequencies by coverage"
- "Centre for mass for domain"
- "Domain range" -- this is identifable by the presense of '...' between two numbers and could potentially be useful when looking for regions of positive selection down the road
- "Number of important positions in domain" -- very ambgious of meaning

Therefore, if only the domain range is calculated, only the data for the domain range will be included, producing the following output:
`XP_001545605.1 BCIN_01g00260  AA4  AA4:1.1 15...250`
If the 'Ratio of maximum frequencies' and domain range are calculated, then two elements of data will be included in the space allocated for 'additional data' within the output FASTA and log files. Also, these elements within the 'additional data' are separated by the same character used to separate the other data fields in the output file (such as '\t' for output log file). Therefore, the following output would be produced:
`XP_001545605.1 BCIN_01g00260  AA4  AA4:1.1  7.4  15...250` 

Therefore, retrieving the predicted domain range can not be achieved by calling for an element in the `prediction_output` list by its index. However, the domain range is easy to identify by its characteristic '..' between two digits. The remaining data can be retrieved based upon its index in the `prediction_output` list.

In [12]:
for line in log_file[:4]:
    # separate the data fields
    prediction_output = line.split("\t")
    
    for item in prediction_output:
        if item.find("..") != -1:
            domain_range = item
            break
        else:
            domain_range = np.nan
    
    # store the prediction output in a dictionary
    prediction = {
        "protein_accession": prediction_output[0],
        "cazy_family": prediction_output[1],
        "cazy_subfamily": prediction_output[-1],
        "ec_number": prediction_output[-2],
        "domain_range": domain_range,
    }

print("protein_accession =", prediction["protein_accession"])
print("CAZy family =", prediction["cazy_family"])
print("CAZy subfamily =", prediction["cazy_subfamily"])
print("EC# =", prediction["ec_number"])
print("Domain range =", prediction["domain_range"])

protein_accession = XP_001547254.1 BCIN_01g00800
CAZy family = AA1
CAZy subfamily = AA1_3
EC# = AA1:1.10.3.2
Domain range = 202..455


This can be written into a single function, and a 'quality' control to check the believed domain range is in the expected format (_digit_.._digit_), to increase the probability of correctly identifying the domain range.

In [13]:
def get_cupp_domain_range(prediction_output):
    """Retrieve the predicted amino acid range of predicted CAZyme domain from CUPP log file.
    
    :param prediction_output: list of items from log file line.
    
    Return string if domain range given, or null value if a domain range is not given.
    """
    domain_range = []  # store as a string in case multiple domain ranges are given

    for item in prediction_output:
        if item.find("..") != -1:
            domain_range.append(item)

    # check retrieved items are definetly the domain ranges
    for item in domain_range:
        try:
            re.match(r"\d+\.\.\d+", item).group()
        except AttributeError:
            # write as logger in pyrewton
            print(f"RAISED ATTRIBUTE ERROR: {item} misidentified as domain range")
            domain_range.remove(item)

    if len(domain_range) == 0:
        domain_range = np.nan
    
    else:
        domain_range = ", ".join(domain_range)
    
    return domain_range


for line in log_file[:4]:
    # separate the data fields
    prediction_output = line.split("\t")
    domain_range = get_cupp_domain_range(prediction_output)
    print(domain_range)


122..568
116..300
26..553
202..455


<a id="cupp_ec"></a>

### Getting the predicted EC number

The CAZy family and associated EC number are stored within the same element in the `prediction_output` list. The CAZy family is always listed first, followed by a colon and then the EC number, which allows the retrieval of only the EC number.

The item at position `[-2]` in prediction_output is the 'Best function' predicted by CUPP, the item at position `[-3]` is the raw function. The item listed is the 'Best function' is taken as the predicted CAZy family and associated EC numbers for the protein.

In [14]:
for line in log_file[:4]:
    # separate the data fields
    prediction_output = line.split("\t")
    
    # retrieve the predicted domain range
    for item in prediction_output:
        if item.find("..") != -1:
            domain_range = item
            break
        else:
            domain_range = np.nan
    
    # retrieve the predicted EC number
    try:
        ec_number = prediction_output[-2].split(":")[1]
    except IndexError:
        ec_number = np.nan
    
    # store the prediction output in a dictionary
    prediction = {
        "protein_accession": prediction_output[0],
        "cazy_family": prediction_output[1],
        "cazy_subfamily": prediction_output[-1],
        "ec_number": ec_number,
        "domain_range": domain_range,
    }

print("protein_accession =", prediction["protein_accession"])
print("CAZy family =", prediction["cazy_family"])
print("CAZy subfamily =", prediction["cazy_subfamily"])
print("EC# =", prediction["ec_number"])
print("Domain range =", prediction["domain_range"])

protein_accession = XP_001547254.1 BCIN_01g00800
CAZy family = AA1
CAZy subfamily = AA1_3
EC# = 1.10.3.2
Domain range = 202..455


<a id="cupp_ec_quality"></a>

### Adding quality checks when retrieving EC numbers

The retrieval of the predicted EC number appears susceptible to errors. The appear to result from not fully analysing all potential formats that CUPP may write out its predicted EC numbers.

Therefore, first lets look at the EC numbers listed in the CUPP output log file.

In [15]:
for line in log_file:
    # separate the data fields
    prediction_output = line.split("\t")
    print(prediction_output[-2], repr(prediction_output[-2]))

AA4:1.1.3.38 'AA4:1.1.3.38'
GT2:2.4.1.117 'GT2:2.4.1.117'
 ''
AA1:1.10.3.2 'AA1:1.10.3.2'
 ''
 ''
GT39:2.4.1.109 'GT39:2.4.1.109'
GT2:2.4.1.16 'GT2:2.4.1.16'
AA11:1.*.*.* 'AA11:1.*.*.*'
GH71:3.2.1.59 'GH71:3.2.1.59'
GH71:3.2.1.59 'GH71:3.2.1.59'
GH38:3.2.1.24 'GH38:3.2.1.24'
GH38:3.2.1.24 'GH38:3.2.1.24'
GT2:2.4.1.16 'GT2:2.4.1.16'
GT2:2.4.1.16 'GT2:2.4.1.16'
 ''
GH47:3.2.1.113 'GH47:3.2.1.113'
 ''
GH114:3.2.1.109 'GH114:3.2.1.109'
GH135:3.2.1.* 'GH135:3.2.1.*'
GH16:2.4.1.*&3.2.1.*-GH16:3.2.1.39 'GH16:2.4.1.*&3.2.1.*-GH16:3.2.1.39'
GH47:3.2.1.113 'GH47:3.2.1.113'
CE12:3.1.1.* 'CE12:3.1.1.*'
AA1:1.10.3.2 'AA1:1.10.3.2'
 ''
GH28:3.2.1.67 'GH28:3.2.1.67'
 ''
GH28:3.2.1.15 'GH28:3.2.1.15'
GH125:3.2.1.* 'GH125:3.2.1.*'
AA1:1.10.3.2 'AA1:1.10.3.2'
GH27:3.2.1.49 'GH27:3.2.1.49'
GH135:3.2.1.* 'GH135:3.2.1.*'
GT50:2.4.1.* 'GT50:2.4.1.*'
CE16:3.1.1.6 'CE16:3.1.1.6'
 ''
GT20:2.4.1.15 'GT20:2.4.1.15'
 ''
CE5:3.1.1.74 'CE5:3.1.1.74'
 ''
CE9:3.5.1.25 'CE9:3.5.1.25'
GH13:2.4.1.25&3.2.1.33 'GH13:2.4.1

The above shows that when no EC number is predicted an empty string is provided (in the same manner as discussed when retrieving the predicted CAZy subfamily). It is better to store missing data as a null value instead. Furthermore, sometimes instead of writing out an empty string the word 'Unknown' is written out. For standardisation of the output this to will be converted to a null value.

Additionally, it is more common practise to annotate missing digits in an EC number with `-` rather than `*`. Changing `*` or `-` would also bring the EC number output from CUPP in line with the other CAZyme prediction tools.

For human readability, when multiple EC numbers are predicted, the '&' will be changed to separate the EC numbers by a comma and match the formating of presenting multiple predicted CAZy subfamilies.

Another issue with the retrieval of the predicted EC numbers is that the above method does not retrieval _all_ predicted EC numbers. If multiple EC numbers are predicted these predictions may be separated by '-'.

For example, protein XP_001561324.1 BCIN_01g06010, has the following output in the log file:
```
['XP_001561324.1 BCIN_01g06010', 'GH16', 'GH16:21.1', '14.8', '4.11', '123', '89..206', '56', 'GH16:2.4.1.*&3.2.1.*-GH16:3.2.1.39', 'GH16:2.4.1.*&3.2.1.*-GH16:3.2.1.39', '']
```
Where, `GH16:2.4.1.*&3.2.1.*-GH16:3.2.1.39` and `GH16:2.4.1.*&3.2.1.*-GH16:3.2.1.39`, contain multiple listed EC numbers, separated by '-'.

Therefore, additional checks for multiple predicated EC numbers and quality checking of EC number formating to ensure only the EC number is retrieved was added to the method of parsing the log file.

The below examples look at the first few lines to highlight some important examples of data.

In [16]:
for line in log_file[462:]:
    # separate the data fields
    prediction_output = line.split("\t")
    
    # retrieve the predicted EC number
    ec_data = prediction_output[-2].split(":")

    if len(ec_data) == 1:  # no EC number predicted
        ec_number = np.nan

    elif len(ec_data) == 2:
        # element 0 is the CAZy family
        all_ec_numbers = []
        ec_split = ec_data[1].split("&")  # if multiple EC numbers are predicted, separate them

        for item in ec_split:
            # check formating, somtimes 'Unknown' is written in stead of an EC number
            try:
                re.match(r"\d+?\.(\d+?|\*)\.(\d+?|\*)\.(\d+?|\*)", item)
                item = item.replace("*","-")  # standardise missing digit format
            except AttributeError:
                pass
            all_ec_numbers.append(item)

        if len(all_ec_numbers) == 0:
            ec_number = np.nan
        else:
            ec_number = ", ".join(all_ec_numbers)
        print("Checking EC# parsing\nec_split=", ec_split, "\nall_ec_numbers=", all_ec_numbers, "\nec_number=", ec_number)


Checking EC# parsing
ec_split= ['2.4.1.109'] 
all_ec_numbers= ['2.4.1.109'] 
ec_number= 2.4.1.109
Checking EC# parsing
ec_split= ['2.4.1.1'] 
all_ec_numbers= ['2.4.1.1'] 
ec_number= 2.4.1.1
Checking EC# parsing
ec_split= ['2.4.1.256'] 
all_ec_numbers= ['2.4.1.256'] 
ec_number= 2.4.1.256
Checking EC# parsing
ec_split= ['2.4.1.*'] 
all_ec_numbers= ['2.4.1.-'] 
ec_number= 2.4.1.-
Checking EC# parsing
ec_split= ['3.2.1.101'] 
all_ec_numbers= ['3.2.1.101'] 
ec_number= 3.2.1.101
Checking EC# parsing
ec_split= ['2.4.1.173'] 
all_ec_numbers= ['2.4.1.173'] 
ec_number= 2.4.1.173
Checking EC# parsing
ec_split= ['3.2.1.59'] 
all_ec_numbers= ['3.2.1.59'] 
ec_number= 3.2.1.59
Checking EC# parsing
ec_split= ['2.4.1.*'] 
all_ec_numbers= ['2.4.1.-'] 
ec_number= 2.4.1.-
Checking EC# parsing
ec_split= ['3.2.1.14'] 
all_ec_numbers= ['3.2.1.14'] 
ec_number= 3.2.1.14
Checking EC# parsing
ec_split= ['3.2.1.106'] 
all_ec_numbers= ['3.2.1.106'] 
ec_number= 3.2.1.106
Checking EC# parsing
ec_split= ['3.2.1.4'] 


From this we can define a function to retrieved the predicted EC numbers for the CAZy family that CUPP has listed as the 'Best function' for the protein.

In [17]:
def get_cupp_ec_number(prediction_output):
    """Retrieve predicted EC numbers from CUPP log file.
    
    EC numbers are represented as "CAZy_fam:EC_number" in the log file.
    If multiple EC numbers are predicated and the CAZy families of each are
    given then these are separated by '-'. If multiple EC numbers are predicated
    for the same CAZy family, these are separated by '&'.
    
    :param prediciton_output: list of items from log file line.
    
    Return string if EC numbers are given, or null value if not.
    """       
    # retrieve the data from the CUPP 'Best function' prediction
    ec_data = prediction_output[-2].split(":")
    
    if len(ec_data) == 1:  # Result of ":" not being present, caused by no EC number predictions
        ec_number = np.nan
        return ec_number

    # create empty list to store all predicted EC numbers
    all_ec_numbers = []

    for item in ec_data:
        # check if the item may be an EC number, which start with a digit
        try:
            re.match(r"\d.+", item).group()
        except AttributeError:  # not an EC number
            continue

        # if multiple best function and/or EC number predictions were made they may
        # be separated by a dash '-'
        dash_separated_data = item.split("-")

        for data in dash_separated_data:
            # if multiple EC numbers are predicated they are separate by '&
            split_data = data.split("&")
            
            for string in split_data:
                # check if the string is an EC number
                try:
                    re.match(r"\d+?\.(\d+?|\*)\.(\d+?|\*)\.(\d+?|\*)", string). group()
                    all_ec_numbers.append(string)
                except AttributeError:  # not an EC number
                    continue
                    
    # standardise missing digits in the EC numbers from '*' to '-'                
    index = 0
    for index in range(len(all_ec_numbers)):
        all_ec_numbers[index] = all_ec_numbers[index].replace("*","-")
        # Sometimes 'Unknown' is written out by CUPP, ensure this is removed
        if all_ec_numbers[index].find("Unknown") != -1:
            all_ec_numbers.remove(all_ec_numbers[index])
                    
    # write out the EC numbers in easy human readable format or create null value if none were predicted
    if len(all_ec_numbers) == 0:
        ec_number = np.nan
    else:
        ec_number = ", ".join(all_ec_numbers) 

    return ec_number

for line in log_file:
    # separate the data fields
    prediction_output = line.split("\t")
    ecs = get_cupp_ec_number(prediction_output)
    print(ecs)

1.1.3.38
2.4.1.117
nan
1.10.3.2
nan
nan
2.4.1.109
2.4.1.16
1.-.-.-
3.2.1.59
3.2.1.59
3.2.1.24
3.2.1.24
2.4.1.16
2.4.1.16
nan
3.2.1.113
nan
3.2.1.109
3.2.1.-
2.4.1.-, 3.2.1.-, 3.2.1.39
3.2.1.113
3.1.1.-
1.10.3.2
nan
3.2.1.67
nan
3.2.1.15
3.2.1.-
1.10.3.2
3.2.1.49
3.2.1.-
2.4.1.-
3.1.1.6
nan
2.4.1.15
nan
3.1.1.74
nan
3.5.1.25
2.4.1.25, 3.2.1.33
2.4.1.25, 3.2.1.33
nan
3.1.1.74
nan
3.1.1.11
3.2.1.-
nan
3.2.1.101
nan
3.2.1.-
nan
nan
nan
nan
nan
3.2.1.59
3.2.1.59
nan
3.2.1.96
2.4.1.80
nan
nan
3.2.1.22, 3.2.1.49
3.2.1.4
2.4.1.109
nan
1.-.-.-
2.4.99.18
nan
1.10.3.2
1.10.3.2
3.2.1.21, 3.2.1.23
2.4.1.131
nan
3.2.1.6, 3.2.1.73
nan
nan
2.4.1.259, 2.4.1.261
2.4.1.-
3.2.1.20, 5.4.99.11
3.2.1.15
nan
nan
nan
2.4.1.34
2.4.1.-
nan
3.1.1.-
3.2.1.6, 3.2.1.73
3.2.1.6, 3.2.1.73
nan
1.10.3.2
3.2.1.99
3.2.1.101
3.2.1.45
2.4.1.15
3.2.1.-
3.2.1.23
3.2.1.59
4.2.2.10
nan
3.2.1.8
nan
1.1.3.13
nan
3.2.1.78
nan
nan
3.2.1.15
2.4.1.-
nan
3.2.1.22
nan
nan
3.2.1.8
nan
3.2.1.151
3.1.1.11
nan
3.2.1.4
2.4.1.-
nan
3.1.1.11


<a id="cupp_subfam"></a>

### Getting the predicted CAZy subfamily

Sometimes when no subfamily is predicted, an empty string is written out the output log file. However, it is better if the resulting Pandas dataframe contains a null value instead.

At this stage we can also create an empty dataframe, to which the data from each parsed protein is added.

In [18]:
# build an empty dataframe
cupp_df = pd.DataFrame(columns=[
    "protein_accession",
    "cazy_family",
    "cazy_subfamily",
    "ec_number",
    "domain_range",
])

for line in log_file[:10]:
    # separate the data fields
    prediction_output = line.split("\t")
    
    # retrieve the predicted domain range if given
    for item in prediction_output:
        if item.find("..") != -1:
            domain_range = item
            break
        else:
            domain_range = np.nan
    
    # retrieve the predicted EC number
    try:
        ec_number = prediction_output[-2].split(":")[1]
    except IndexError:
        ec_number = np.nan

    # store the prediction output in a dictionary
    prediction = {
        "protein_accession": [prediction_output[0]],
        "cazy_family": [prediction_output[1]],
        "cazy_subfamily": [prediction_output[-1]],
        "ec_number": ec_number,
        "domain_range": [domain_range],
    }

    # if no CAZy subfamily was predicted, change empty string to null value
    if len(prediction["cazy_subfamily"][0]) == 0:
        prediction["cazy_subfamily"][0] = np.nan

    # create dataframe to represent the current working protein
    prediction_df = pd.DataFrame(prediction)
    # add protein to CUPP dataframe
    cupp_df = cupp_df.append(prediction_df)

cupp_df

Unnamed: 0,protein_accession,cazy_family,cazy_subfamily,ec_number,domain_range
0,XP_001545605.1 BCIN_01g00260,AA4,,1.1.3.38,122..568
0,XP_024545922.1 BCIN_01g00340,GT2,,2.4.1.117,116..300
0,XP_001547235.1 BCIN_01g00640,AA3,AA3_1,,26..553
0,XP_001547254.1 BCIN_01g00800,AA1,AA1_3,1.10.3.2,202..455
0,XP_024546029.1 BCIN_01g01760,GH18,,,116..258
0,XP_024546029.1 BCIN_01g01760,PL7,PL7:2,,989..1040
0,XP_001548518.1 BCIN_01g02310,GT39,,2.4.1.109,62..264
0,XP_001548545.1 BCIN_01g02520,GT2,,2.4.1.16,213..452
0,XP_001549934.1 BCIN_01g02970,AA11,,1.*.*.*,55..160
0,XP_024546135.1 BCIN_01g03210,GH71,,3.2.1.59,71..381


<a id="cupp_std_subfam"></a>

### Standardising CAZy subfamilies (PL7:2)

Sometimes the CAZy subfamily is not written in the standard CAZy format of family_subfam#. For example, for protein "XP_024546029.1 BCIN_01g01760" the CAZy subfamily is listed as 'PL7:2'.

Looking further through the CUPP output log file, there are several instances of a this occuring, and often occuring for CAZy families for which CAZy has not created a subfamily with number outputted by CUPP. For example, protein XP_001549864.1 BCIN_03g07360 is predicted to have the subfamily 'PL1:4', however, there is not CAZy subfamily 'PL1:4'.

It was hypothesised that listing a CAZy subfamily with a colon instead of the traditional CAZy underscore was done to indicate when the predict CAZy subfamily is one created by CUPP and is not in CAZy.

Looking at lines 2890 and 2914 in the `CUPP.CUPPprediction.py` script, the `:` in the predicated subfamily is replaced by `_` when writing the output FASTA file. Therefore, the predicted CAZy subfamily is written as 'PL7_2' in the output FASTA file. However, in line 2909, when writing out to the log file, the replacement of `:` to `_` is not performed. Hence, the predicted CAZy subfamily is written as 'PL7:2' in the output log file.

For standardistion, the colon in the predicted CAZy subfamily was changed to an underscore.

Moreover, when looking through the log file, multiple subfamilies can be predicted for a given protein. These subfamilies are separated by '+'. To improve human readability this was changed to separate the subfamilies by commas (', '). For example, CUPP's analysis of protein sequences from the genome GCA143535.1 predicted two CAZy subfamilies within the GH13 CAZyme domain of the protein XP_001552718.2.  
`XP_001552718.2 BCIN_02g05690	GH13	GH13:0.2	15.7	5.93	151	70..391	76	GH13:3.2.1.10&3.2.1.20	GH13:3.2.1.20-GH13:5.4.99.11	GH13:31+GH13:40`

In [19]:
for line in log_file[:10]:
    # separate the data fields
    prediction_output = line.split("\t")

    # retrieve predicated CAZy subfamily
    subfam = prediction_output[-1] 
    # if no CAZy subfamily was predicted, change empty string to null value
    if len(subfam) == 0:
        subfam = np.nan
    else:
        # write subfamilies using CAZy standard format
        subfam = subfam.replace(":","_")
        subfam = subfam.replace("+", ", ")  # separate multiple subfams with ', ' not '+'

    print("subfam=", subfam)

subfam= nan
subfam= nan
subfam= AA3_1
subfam= AA1_3
subfam= nan
subfam= PL7_2
subfam= nan
subfam= nan
subfam= nan
subfam= nan


<a id="cupp_functions"></a>

### Combing functions and populating the data model, and adding non-CAZymes

The above can be combined into functions, that are then included in `pyrewton.cazymes.prediction.parse`. When it comes to evaluating CUPP's performance when differentiating between CAZymes and non-CAZymes, a binary classification of CAZymes (1) and non-CAZyme (0) is needed for each protein. If a protein is included in the CUPP output CUPP has predicted that the protein is a CAZyme, and thus an additional column with a value of 1 can be added to the standardised CUPP output, to represent the cazyme/non-cazyme classification of proteins by CUPP.

For the evaluation CUPP's performance the information of none CAZymes is also needed. CUPP only includes the data for proteins in it's log file that have classified as CAZymes therefore, the remaining protein sequences are retrieved from the FASTA file.

In [20]:
def parse_cupp_output(log_file_path, fasta_path):
    """Parse the output log file from CUPP and write out a dataframe.
    
    Retrieves the protein accession, predicted CAZy families, CAZy families EC
    numbers and domain ranges.
    
    :param log_file_path: Path, path to the output log file
    :fasta path: Path, path to the FASTA containing the query sequences for CUPP
    
    Return a list of CUPPprediction instances, where each instance represented one protein.
    """
    # open the CUPP output log file
    with open(log_file_path, "r") as lfh:
        log_file = lfh.read().splitlines()
    
    cupp_predictions = {}  # stores proteins {protein_accession:CUPPprediction_instance}

    for line in tqdm(log_file, desc="parsing cupp output"):
        # separate the data fields
        prediction_output = line.split("\t")

        # retrieve the predicted domain range if given
        domain_range = get_cupp_domain_range(prediction_output)  # list of strs
        
        # retrieve predicted EC number if given
        ec_numbers = get_cupp_ec_number(prediction_output)  # list of strs
        
        # retrieve predicted CAZy subfamily if given
        cazy_subfamily = prediction_output[-1] 
        # if no CAZy subfamily was predicted, change empty string to null value
        if len(cazy_subfamily) == 0:
            cazy_subfamily = np.nan
        else:
            # write subfamilies using CAZy standard format
            cazy_subfamily = cazy_subfamily.replace(":","_")
            cazy_subfamily = cazy_subfamily.replace("+", ", ")  # separate multiple subfams with ', ' not '+'

        # retrieve the protein accession and check if it already has a corresponding CUPPprediction instance
        protein_accession = prediction_output[0]

        # retrieve the CAZyme classification and predicted CAZy family
        cazyme_classification = 1  # all proteins included in CUPP log file are identified as CAZymes
        cazy_family = prediction_output[1]  # This is selected from the CUPP 'Best function'
        
        # check if a CUPPprediction instance already exists for the protein
        try:
            existing_prediction = cupp_predictions[protein_accession]
            print("existing=", existing_prediction)
            
            # check if the CAZyme domain has been been parsed before
            print("existing domains=", len(existing_prediction.cazyme_domains), existing_prediction.cazyme_domains)
            existing_cazyme_domains = existing_prediction.cazyme_domains
            
            existance = False
            for domain in existing_cazyme_domains:
                if (domain.cazy_family == cazy_family) and (domain.cazy_subfamily == cazy_subfamily):
                    for ec in ec_numbers:
                            domain.ec_numbers.append(ec)
                    for drange in domain_range:
                            domain.domain_range.append(drange)
                    existance = True
            
            if existance is False:
                # create new CAZyme domain
                new_cazyme_domain = CazymeDomain(
                    protein_accession,
                    cazy_family,
                    cazy_subfamily,
                    ec_numbers,
                    domain_range,
                )
                existing_prediction.cazyme_domains.append(new_cazyme_domain)
        
        except KeyError:  # raised if there is not instance for the protein

            new_cazyme_domain = CazymeDomain(
                protein_accession,
                cazy_family,
                cazy_subfamily,
                ec_numbers,
                domain_range,
            )
            
            new_protein = CazymeProteinPrediction(
                "CUPP",
                protein_accession,
                cazyme_classification,
                [new_cazyme_domain],
            )
            
            cupp_predictions[protein_accession] = new_protein
    
    print("len1=", len(list(cupp_predictions.values())))

    # add non-CAZymes
    cupp_predictions = add_non_cazymes(fasta_path, cupp_predictions)
    
    cupp_predictions = list(cupp_predictions.values())

    return cupp_predictions


def get_cupp_domain_range(prediction_output):
    """Retrieve the predicted amino acid range of predicted CAZyme domain from CUPP log file.
    
    :param prediction_output: list of items from log file line.
    
    List of predicted domain ranges or null value in a list if not.
    """
    domain_range = []  # store as a list in case multiple domain ranges are given

    for item in prediction_output:
        if item.find("..") != -1:
            domain_range.append(item)

    # check retrieved items are definetly the domain ranges
    for item in domain_range:
        try:
            re.match(r"\d+\.\.\d+", item).group()
        except AttributeError:
            # write as logger in pyrewton
            print(f"RAISED ATTRIBUTE ERROR: {item} misidentified as domain range")
            domain_range.remove(item)

    if len(domain_range) == 0:
        domain_range = [np.nan]
    
    return domain_range


def get_cupp_ec_number(prediction_output):
    """Retrieve predicted EC numbers from CUPP log file.
    
    EC numbers are represented as "CAZy_fam:EC_number" in the log file.
    If multiple EC numbers are predicated and the CAZy families of each are
    given then these are separated by '-'. If multiple EC numbers are predicated
    for the same CAZy family, these are separated by '&'.
    
    :param prediciton_output: list of items from log file line.
    
    List of predicted EC numbers or null value in a list if not.
    """       
    # retrieve the data from the CUPP 'Best function' prediction
    ec_data = prediction_output[-2].split(":")
    
    if len(ec_data) == 1:  # Result of ":" not being present, caused by no EC number predictions
        ec_number = np.nan
        return ec_number

    # create empty list to store all predicted EC numbers
    all_ec_numbers = []

    for item in ec_data:
        # check if the item may be an EC number, which start with a digit
        try:
            re.match(r"\d.+", item).group()
        except AttributeError:  # not an EC number
            continue

        # if multiple best function and/or EC number predictions were made they may
        # be separated by a dash '-'
        dash_separated_data = item.split("-")

        for data in dash_separated_data:
            # if multiple EC numbers are predicated they are separate by '&
            split_data = data.split("&")
            
            for string in split_data:
                # check if the string is an EC number
                try:
                    re.match(r"\d+?\.(\d+?|\*)\.(\d+?|\*)\.(\d+?|\*)", string). group()
                    all_ec_numbers.append(string)
                except AttributeError:  # not an EC number
                    continue
                    
    # standardise missing digits in the EC numbers from '*' to '-'                
    index = 0
    for index in range(len(all_ec_numbers)):
        all_ec_numbers[index] = all_ec_numbers[index].replace("*","-")
        # Sometimes 'Unknown' is written out by CUPP, ensure this is removed
        if all_ec_numbers[index].find("Unknown") != -1:
            all_ec_numbers.remove(all_ec_numbers[index])
                    
    # write out the EC numbers in easy human readable format or create null value if none were predicted
    if len(all_ec_numbers) == 0:
        all_ec_numbers = [np.nan]

    return all_ec_numbers


def add_non_cazymes(fasta_path, cupp_predictions):
    """Add proteins that CUPP identified as non-CAZymes to the collection of CUPPprediction instances.
    
    :param fasta_path: Path, FASTA file used as input for CUPP.
    :param cupp_predictions: dict, key=protein_accession, value=CUPPprediction instance
    
    Return a dictionary valued by protein accessions and keyed by their respective CUPPprediction instance.
    """
    print("len=", len(list(cupp_predictions.values())))
    # open the FASTA path
    with open(fasta_path) as fh:
        fasta = fh.read().splitlines()

    for line in tqdm(fasta, desc="Adding non-CAZymes"):
        if line.startswith(">"):
            protein_accession = line[1:].strip()
            
            # check if the protein has been listed as a CAZyme by CUPP
            try:
                cupp_predictions[protein_accession]
            except KeyError:
                # raised if protein not in cupp_predictions, inferring proten was not labelled as CAZyme
                cazyme_classification = 0
                cupp_predictions[protein_accession] = CazymeProteinPrediction(
                    "CUPP",
                    protein_accession,
                    cazyme_classification,
                )
        
    return cupp_predictions
    

fasta_path = "genbank_proteins_txid73501_GCA_008080495_1.fasta"
result = parse_cupp_output("gca_008080495_cupp.fasta.log", fasta_path)

print("type=", type(result))
print("len=", len(result))
print("index 0=", result[0])

for i in result:
    print(i)

HBox(children=(HTML(value='parsing cupp output'), FloatProgress(value=0.0, max=282.0), HTML(value='')))

existing= -CazymeProteinPrediction, protein=ATY67037.1 A9K55_000868, CAZyme domains=1-
existing domains= 1 [<CazymeDomain parent=GT41 fam=nan, subfam=nan>]
existing= -CazymeProteinPrediction, protein=ATY61276.1 A9K55_005467, CAZyme domains=1-
existing domains= 1 [<CazymeDomain parent=GH13 fam=GH13_25, subfam=['2.4.1.25', '3.2.1.33']>]
existing= -CazymeProteinPrediction, protein=ATY62702.1 A9K55_007186, CAZyme domains=1-
existing domains= 1 [<CazymeDomain parent=CE3 fam=nan, subfam=nan>]

len1= 279
len= 279


HBox(children=(HTML(value='Adding non-CAZymes'), FloatProgress(value=0.0, max=93219.0), HTML(value='')))


type= <class 'list'>
len= 9287
index 0= -CazymeProteinPrediction, protein=ATY67280.1 A9K55_000020, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67280.1 A9K55_000020, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67467.1 A9K55_000078, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67289.1 A9K55_000120, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67290.1 A9K55_000123, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67196.1 A9K55_000147, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67226.1 A9K55_000162, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67472.1 A9K55_000170, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67327.1 A9K55_000335, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67309.1 A9K55_000364, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67094.1 A9K55_000365, CAZyme domains=1-
-CazymeProteinPrediction, protein=ATY67306.1 A9K55_000368, CAZyme domains=1-
-CazymeProteinPrediction, protein=A

-CazymeProteinPrediction, protein=ATY67401.1 A9K55_000209, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67124.1 A9K55_000210, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67123.1 A9K55_000211, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67122.1 A9K55_000212, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67121.1 A9K55_000213, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67128.1 A9K55_000214, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67127.1 A9K55_000215, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67126.1 A9K55_000216, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67125.1 A9K55_000217, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67130.1 A9K55_000218, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67129.1 A9K55_000219, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67311.1 A9K55_000220, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67207.1 A9K55_000221, non-CAZyme-
-CazymeProteinPrediction, protein=ATY67188.1 A9K55_000222, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY66110.1 A9K55_001815, non-CAZyme-
-CazymeProteinPrediction, protein=ATY66106.1 A9K55_001816, non-CAZyme-
-CazymeProteinPrediction, protein=ATY66107.1 A9K55_001817, non-CAZyme-
-CazymeProteinPrediction, protein=ATY66165.1 A9K55_001818, non-CAZyme-
-CazymeProteinPrediction, protein=ATY66166.1 A9K55_001819, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65607.1 A9K55_001820, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65904.1 A9K55_001821, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65982.1 A9K55_001822, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65892.1 A9K55_001823, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65883.1 A9K55_001824, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65875.1 A9K55_001825, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65979.1 A9K55_001826, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65978.1 A9K55_001827, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65610.1 A9K55_001828, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY58436.1 A9K55_002527, non-CAZyme-
-CazymeProteinPrediction, protein=ATY58442.1 A9K55_002528, non-CAZyme-
-CazymeProteinPrediction, protein=ATY58441.1 A9K55_002529, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59219.1 A9K55_002530, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59220.1 A9K55_002531, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59221.1 A9K55_002532, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59222.1 A9K55_002533, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59223.1 A9K55_002534, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59224.1 A9K55_002535, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59225.1 A9K55_002536, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59226.1 A9K55_002537, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59227.1 A9K55_002538, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59228.1 A9K55_002539, non-CAZyme-
-CazymeProteinPrediction, protein=ATY58401.1 A9K55_002540, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY64513.1 A9K55_003939, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63844.1 A9K55_003940, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63845.1 A9K55_003941, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63842.1 A9K55_003942, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63843.1 A9K55_003943, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63848.1 A9K55_003944, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63849.1 A9K55_003945, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63846.1 A9K55_003946, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63847.1 A9K55_003947, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63841.1 A9K55_003948, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64467.1 A9K55_003950, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64466.1 A9K55_003951, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64469.1 A9K55_003952, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64468.1 A9K55_003953, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY65208.1 A9K55_004744, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65084.1 A9K55_004745, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65210.1 A9K55_004746, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64413.1 A9K55_004747, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65215.1 A9K55_004748, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65094.1 A9K55_004749, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64105.1 A9K55_004750, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64106.1 A9K55_004751, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64103.1 A9K55_004752, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64104.1 A9K55_004753, non-CAZyme-
-CazymeProteinPrediction, protein=ATY65056.1 A9K55_004755, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64108.1 A9K55_004757, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64097.1 A9K55_004758, non-CAZyme-
-CazymeProteinPrediction, protein=ATY64098.1 A9K55_004759, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY59906.1 A9K55_005744, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59907.1 A9K55_005745, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59912.1 A9K55_005746, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59913.1 A9K55_005747, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59910.1 A9K55_005748, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59911.1 A9K55_005749, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59914.1 A9K55_005750, non-CAZyme-
-CazymeProteinPrediction, protein=ATY59915.1 A9K55_005751, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60736.1 A9K55_005752, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60735.1 A9K55_005753, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60738.1 A9K55_005754, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60737.1 A9K55_005755, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60732.1 A9K55_005756, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60731.1 A9K55_005757, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY60412.1 A9K55_005939, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61171.1 A9K55_005940, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61170.1 A9K55_005941, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60158.1 A9K55_005942, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60159.1 A9K55_005943, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60160.1 A9K55_005944, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60161.1 A9K55_005945, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60154.1 A9K55_005946, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60155.1 A9K55_005947, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60780.1 A9K55_005948, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60157.1 A9K55_005949, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60166.1 A9K55_005950, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60782.1 A9K55_005951, non-CAZyme-
-CazymeProteinPrediction, protein=ATY60985.1 A9K55_005952, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY63260.1 A9K55_007329, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62693.1 A9K55_007330, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63285.1 A9K55_007331, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61505.1 A9K55_007332, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61504.1 A9K55_007333, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61507.1 A9K55_007334, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61654.1 A9K55_007335, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61509.1 A9K55_007336, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62043.1 A9K55_007337, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62617.1 A9K55_007338, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62616.1 A9K55_007339, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61513.1 A9K55_007340, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61512.1 A9K55_007341, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62456.1 A9K55_007342, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY62553.1 A9K55_007575, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62552.1 A9K55_007576, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62550.1 A9K55_007578, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62549.1 A9K55_007579, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62548.1 A9K55_007580, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62547.1 A9K55_007581, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61862.1 A9K55_007582, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61860.1 A9K55_007584, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61861.1 A9K55_007585, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61858.1 A9K55_007586, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61859.1 A9K55_007587, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61856.1 A9K55_007588, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61857.1 A9K55_007589, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61864.1 A9K55_007590, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY63425.1 A9K55_009089, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63432.1 A9K55_009090, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63431.1 A9K55_009091, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62199.1 A9K55_009092, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62200.1 A9K55_009093, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62197.1 A9K55_009094, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62198.1 A9K55_009095, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62203.1 A9K55_009096, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62204.1 A9K55_009097, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62201.1 A9K55_009098, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62202.1 A9K55_009099, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62195.1 A9K55_009100, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62196.1 A9K55_009101, non-CAZyme-
-CazymeProteinPrediction, protein=ATY63191.1 A9K55_009102, non-CAZyme-
-Cazym

-CazymeProteinPrediction, protein=ATY62985.1 A9K55_009354, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62984.1 A9K55_009355, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62979.1 A9K55_009356, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62978.1 A9K55_009357, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62980.1 A9K55_009359, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62977.1 A9K55_009360, non-CAZyme-
-CazymeProteinPrediction, protein=ATY62976.1 A9K55_009361, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61649.1 A9K55_009362, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61650.1 A9K55_009363, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61651.1 A9K55_009364, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61652.1 A9K55_009365, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61645.1 A9K55_009366, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61646.1 A9K55_009367, non-CAZyme-
-CazymeProteinPrediction, protein=ATY61647.1 A9K55_009368, non-CAZyme-
-Cazym

From looking at the printed out class representation strings (for example, at the protein ATY62702.1) we can see that each of the multiple CAZyme domains predicted in each protein are collected under a single `CazymeProteinPrediction` instance.