# CAZyme prediction tool data models

This notebook covers the design of the data models for holding data retrieved from the CAZyme prediciton tools' outputs.

To data retrieved from the CAZyme prediction tool output files is stored as a `CazymeProteinPrediction` class instance. These instances are stored in a dictionary keyed by the protein accession and valued by the respective `CazymeProteinPrediction` instance. This enables checking if data for a given protein has been retrieved from the output file, becuase CUPP and eCAMI can include the data for a given protein over multiple lines. This check faciltiates combining all data for a given protein into a single `CazymeProteinPrediction` instance to represent the query protein. Here, 'query protein' is the protein parsed by the CAZyme prediction tools to predict if the protein is a CAZyme and what CAZy (sub)families CAZy would catalogue the protein under.

All the CAZyme prediction tools can predict that there are multiple CAZyme domains within a CAZyme. Each CAZyme domain is identified as a unique CAZy family that the CAZyme prediction tool has annotated the query protein with. For example, if the CAZyme prediction tool CUPP predicts the query protein A is annotated with the CAZy families GH15 and CBM32, this is interpretted that CUPP predicts protein A contains two CAZyme domains, one belong to the GH15 and the other to CBM32. This decision has been made becuase Hotpep, CUPP and eCAMI predict specific EC number predictions for each of the CAZy family annotations that they predict for a given CAZy family (or CAZyme domain). Additionally, HMMER, Hotpep and CUPP predict different domain ranges for each of the CAZy family annotations they predict for a protein. For example, CUPP may predict a domain range of 5-225 for GH15 and 275-389 for CBM32.

Each predicted CAZyme domain will be represented by a `CazymeDomain` class instance. Every `CazymeDomain` instance will have a:
- `protein_accession` (str), the protein accession of the protein in which they are contained
- `cazy_family` (str), the name of the CAZy family
- `cazy_subfamily` (str), the name of the CAZy subfamily, if given, else the `cazy_subfamily` value will be set as a null value.

Each unique `CazymeDomain` instance will be identfiable by its unique `cazy_family`-`cazy_subfamily` combination.

In addition, the `CazymeDomain` class has the attributes:
- `ec_numbers` a list of strings, with each string containing a unique EC number
- `domain_range`, a list of strings, with each stirng containing a unique domain range
The `domain_range` attribute is set as a string    

Both `ec_numbers` and `domain_ranges` are stored as lists becuase multiple EC numbers and domain ranges can be predicted for a single domain. Arguably, if a CAZy family is listed twice for a query protein, each with different domain ranges could indicate potentially different domains belonging to the same family. However, becuase the accuracy of domain range prediction has not been evaluated so it could be it is the same domain but the prediction tool has only identified two regions of it, and these two regions could still make a single CAZyme domain.

If a CAZyme prediction tool does not predict EC numbers or domain range, these will be set to equal a null value in a list. This allows additional EC numbers or domain ranges to be added to the list of the CAZyme domain is listed again with a predicted EC number or domain range.

In [1]:
class CazymeDomain:
    """Single CAZyme domain in a protein, predicted by a CAZyme prediction tool.
    
    Each unique predicted CAZy family-subfamily combination by a CAZyme prediction
    tool for a query protein sequence is viewed as a CAZyme domain.]
    
    Every CAZyme domain has a source CAZyme prediction tool that predicted the CAZyme
    domain, a parent CAZyme protein (represented by the protein accession), and CAZy
    family and subfamily combination. If no CAZy subfamily is predicted, the 
    subfamily will be listed as a null value.

    Hotpep, CUPP and eCAMI predict EC numbers for each CAZyme domain. HMMER and CUPP
    predict amino acid domain ranges. Multiple EC numbers and domain ranges can be
    predicted for a single CAZyme domain, therefore, these attritbutes are stored as
    lists.
    
    If an item of data is not or was not predicted by a CAZyme prediction tool, the
    respective data will be stored as a null value.
    """
    
    def __init__(
        self,
        prediction_tool,
        protein_accession,
        cazy_family,
        cazy_subfamily=None,
        ec_numbers=None,
        domain_range=None,
    ):
        """Initiate instance
        
        :attr prediction_tool: str, CAZyme prediciton tool which predicted the domain
        :attr protein_accession: str
        :attr cazy_family: str
        :attr cazy_subfamily: str
        :attr ec_numbers: list (list of str, each str contains a unique EC number)
        :attr domain_range: list (list of str, each str contains a unique domain range)
        """
        self.prediction_tool = prediction_tool
        self.protein_accession = protein_accession
        self.cazy_family = cazy_family
        
        # not all CAZyme domans are catalogued under a CAZy subfamily
        if cazy_subfamily is None:
            self.cazy_subfamiy = np.nan
        else:
            self.cazy_subfamily = cazy_subfamily
        
        # EC numbers are not predicted for all CAZyme domains
        if ec_numbers is None:
            self.ec_numbers = []  # enables adding in EC numbers included in another line of the output file
        else:
            self.ec_numbers = ec_numbers
        
        # Not all prediction tools predict CAZyme domains
        if domain_range is None:
            self.domain_range = []  # enables adding domain range listed in another line of the ouput file
        else:
            self.domain_range = domain_range
    
    def __str__(self):
        return f"-CazymeDomain in {self.protein_accession}, fam={self.cazy_family}, subfam={self.cazy_subfamily}-"
    
    def __repr__(self):
        return f"<CazymeDomain parent={self.protein_accession} fam={self.cazy_family}, subfam={self.cazy_subfamily}>"

    
class CazymeProteinPrediction:
    """Single protein and CAZyme/non-CAZyme prediction by a CAZyme prediction tool"""
    
    def __init__(self, prediction_tool, protein_accession, cazyme_classification, cazyme_domains=None):
        """Initate class instance.
        
        :attr prediction_tool: str, name of CAZyme prediction tool
        :attr protein_accession: str
        :attr cazyme_classification: int, 1=CAZyme, 0=non-CAZyme
        :attr cazyme_domains: list of CazymeDomain instances, domain predicted to be in the CAZyme
        """
        self.prediction_tool = prediction_tool
        self.protein_accession = protein_accession
        self.cazyme_classification = cazyme_classification  # CAZyme=1, non-CAZyme=0
        
        # non-CAZymes will have no cazyme_domains
        if cazyme_domains is None:
            self.cazyme_domains = None
        else:
            self.cazyme_domains = cazyme_domains
    
    def __str__(self):
        if self.cazyme_classification == 0:
            return f"-CazymeProteinPrediction, protein={self.protein_accession}, non-CAZyme-"
        else:
            return f"-CazymeProteinPrediction, protein={self.protein_accession}, CAZyme domains={len(self.cazyme_domains)}-"
    
    def __repr__(self):
        return f"<CazymeProteinPrediction, protein={self.protein_accession}>"

    def get_cazyme_classification_dicts(self):
        """Returns a dictionary that contains the protein accession and its CAZyme/non-CAZyme classification.
        
        This dictionary is used to create the cazyme_classification dataframe for evaluating the performance
        of the CAZyme prediction tools in terms of differnetiating between CAZymes (identified with a score of 1)
        and non-CAZymes (identified with a score of 0). The CAZy classes under which the protein is classifed are
        retrieved as well to enable evaluating the CAZyme/non-CAZyme differneitation per CAZy class, and obsever
        differences between this performance between classes. A score of 0 incidates a protein was not classified
        under the class, a score of 1 indicates the protein as classified under the class.
        """
        cazyme_classification_dict = {
            "protein_accession": self.protein_accession,
            "cazyme_classification": self.cazyme_classification,
            "GH": 0,
            "GT": 0,
            "CE": 0,
            "PL": 0,
            "AA": 0,
            "CBM": 0,
        }
        
        # check which CAZy classes the predicted CAZyme domains were classified under
        for domain in cazyme_domains:
            if domain.cazy_family[:3] == "CBM":
                cazyme_classification_dict["CBM"] = 1
            else:
                cazyme_classification_dict[family[:2]] = 1

        return cazyme_classification_dict

    def get_cazy_family_dict(self):
        """Return a dictionary of the predicted CAZy families for a protein.
        
        This dictionary is used to create a dataframe of CAZy family predictions for evaluating the performance
        of the prediction tools to predict the CAZy family of a CAZyme. The dictionary is keyed by the CAZy
        classes, and valued by a list of the corresponding families predicted for the CAZyme. The is to enable
        evaluating the CAZy family prediction performance per class."""
        cazy_family_dict = {
            "GH": [],
            "GT": [],
            "CE": [],
            "PL": [],
            "AA": [],
            "CBM": [],
        }
        
        for domain in self.cazyme_domains:
            if domain.cazy_family[:3] == "CBM":
                cazy_family_dict["CBM"].append(domain.cazy_family)
            else:
                cazy_family_dict[domain.cazy_family[:2].append(domain.cazy_family)]
        
        return cazy_family_dict
