# Parsing eCAMI output

The CAZyme prediction tools dbCAN, CUPP and eCAMI write out their output files in different formats, and do not present the data in a standardised manner across all tools. Additionally, the output from the prediction tools does not enable easy parsing of the data by the [Sklearn](https://scikit-learn.org/stable/index.html) library, which performs the statistical analysis of the tools preformances.

This notebook explains the theorlogy and method used by the `pyrewton.cazymes.prediction.parse` submodule to parse the output from the CAZyme prediction tool **eCAMI**.

[eCAMI](https://academic.oup.com/bioinformatics/article/36/7/2068/5651014?login=true) is a k-mer based CAZyme prediction tool. The information defining the data included in each field of the output files came from the [README](https://github.com/zhanglabNKU/eCAMI) and the Python script `prediction.py`.

The following data is collected from the eCAMI output:
- protein_accession
- cazy_family (all predicated CAZy families)
- cazy_subfamily (all predicated CAZy subfamilies)
- ec_number (all predicted EC numbers)

### Notebook structure

The section titled **'exploring the output'** shows the exploration of the tools output files. This is to demonstrate how the data is constructed and presented, and forms a foundation for explaining the developed method of parsing the data.

The section **'parsing the output'** shows the method of parsing the data from the CAZyme prediction tool.

The final section shows any additional parsing that was performed to make the statistical analysis of the tool's performance easier. This typically includes removing duplicates, and combining rows that represent data for the same predicted CAZyme.


## Contents

- [Imports](#notebook_imports)
- [Exploring the eCAMI output text file](#text_file)
- [Defining the data model](#data_model)
- [Getting the predicted CAZy family for the CAZyme domain](#cazy_fam)
- [Getting the predicted CAZy subfamily for the CAZyme domain](#cazy_subfam)
- [Getting the predicted EC numbers](#ecnum)
- [Combining data for a single protein](#combining_proteins)
- [Adding non-CAZymes](#no_caz)
- [Final functions](#final)

<a id="notebook_imports"></a>

# Imports

In [1]:
# Install necessary libraries
!pip3 install pandas
!pip3 install numpy

import re

import pandas as pd
import numpy as np

from tqdm.notebook import tqdm



<a id="text_file"></a>

## Exploring the eCAMI output text file

eCAMI produces a single output file, which is a plain text file in the style of a FASTA-like file. The definitions and explanations of the data in the output file is taken from the multiple REAMEs in the eCAMI [GitHub repository](https://github.com/zhanglabNKU/eCAMI)  -- there is a `readme.md` file each in the `examples/prediction` and `examples/clustering` directories.

Each predicted CAZyme can be predicted to be annotated with multiple CAZy families. Each unique CAZy familiy/CAZy subfamily combination is interpreted as a unique CAZyme domain prediction. Each CAZyme domain prediction is listed as a separate item in the output text file from eCAMI.

Each CAZyme domain prediction is stored across two lines, with the first line starting with '>'.

For example:

```
>ATY67280.1 A9K55_000020 	GH17:43	GH17:24|3.2.1.-:1
LHKVFPGM(369),HKVFPGMD(370),KVFPGMDY(371),VFPGMDYT(372),FPGMDYTP(373),PPSQNNVT(392),PSQNNVTR(393),SQNNVTRD(394),QNNVTRDV(395),NNVTRDVA(396),NVTRDVAV(397),VTRDVAVL(398),TRDVAVLS(399),RDVAVLSQ(400),DVAVLSQL(401),AVLSQLTN(403),IRLYGTDC(412),RLYGTDCN(413),LYGTDCNQ(414),YGTDCNQT(415),GTDCNQTQ(416),TDCNQTQM(417),IVANEILF(476),VANEILFR(477),LPVATSDL(510),PVATSDLG(511),VATSDLGD(512),ATSDLGDD(513),TSDLGDDW(514),IMANIHPF(532),MANIHPFF(533),ISETGWPS(573),NGTNYFWF(618),GTNYFWFE(619),TNYFWFEA(620),NYFWFEAF(621),YFWFEAFD(622),FWFEAFDE(623),WFEAFDEP(624),FEAFDEPW(625),EAFDEPWK(626),WEDKWGLL(643),GVKIPDCG(659),VKIPDCGG(660),
```

The first line (denoted with the '>' prefix, contains the following:
- 'protein_name' the protein accession, taken straight from the description/identifier line from the input FASTA file of all query protein sequences parsed by eCAMI
- 'fam_name:group_number' -- this is the `GH17:43` in the above example
    - the 'fam_name' is the CAZy family assigned to the CAZyme by ecAMI
    - the 'group_number' is the cluster number (which acts as an ID number for the kmer cluster) for the respective CAZy family
- 'subfam_name_of_the_group:subfam_name_count' -- in the above example this is `GH17:24|3.2.1.-:1`
    - this data is extracted from the kmer library
    - each 'label' (where each label is separated with a '|') is taken from the 'known labels' of CAZymes used to create the kmer clusters. Specifically, to create the kmer clusters, eCAMI parses a FASTA file of all protein sequences to be used for the clustering. In the protein data line (the line prefixed with '>') for each protein contains the protein name/accession and 'known labels' of that proteins. These known labels include CAZy family, CAZy subfamily and EC number annotations/labels. When the kmer cluster is created, the known lables of the proteins included in the cluster are stored within the cluster as a `kmer_message` (this name was taken from the `prediction.py` script from eCAMI). This `kmer_message` is then written out as the 'subfam_name_of_the_group:subfam_name_count' in the output file.
    - the 'subfam_name_count' is the number of times the respective known label appears in the kmer cluster

The second line contains the kmers from the predicted CAZy family's kmer cluster that were found in the query protein sequence, and includes the amino acid starting position of the kmer in the quer protein sequence, in brackets.

To determine the CAZy family annotation predictions for each query protein sequence, eCAMI performs its search of all kmers in every CAZyme cluster they have created. eCAMI scores the number of kmers for each CAZy family that appear in the query protein sequence, and scores the sum frequency of the kmers from the cluster. To be more explicit, during the generation of the kmer clusters, a frequency score is calculated for each identified kmer, and this is stored along side the kmer amino acid sequence in the kmer library. Then for each CAZy family, and for each of its charactistic kmers found in the query protein sequence, the sum of these kmer frequencies from the cluster library is calculated, and this assigned to a variabled called `number_score`. As long for the associated CAZy family the `score` is greater than the `important_k_mer_number` (the minimum number of times kmers for a family need to appear in the query protein sequence), and the `number_score` is greater than `beta` (the minimum sum of the kmer frequencies, default is 2) then the CAZy family and its associated score is stored within the `sort_fam`, which is a list of lists, with each internal list representing one unqiue CAZy family. These CAZy families are then ranked by their `score`. The highest scoring CAZy family is listed as the `first_fam` and is written to the ouput file in the formate displayed above.

To find other predicted CAZyme domains (CAZy family annotations), eCAMI then retrieves the `kmer_message` for the `first_fam`. It then iterates through the 'known labels' in the `kmer_message`, and for each CAZy family found in the 'known labels', checks if the CAZy family is in `sort_fam`. If it is then the CAZy family is written out to the output file for the query protein sequence in the same format written out above, often with an extremely similar `kmer_message` to the first CAZy family.  
  
<div class="alert-warning">
<p></p>
<b>Note:</b> For CAZy families in the `kmer_message` but not in the `sort_fam`, these CAZy families are **not** written out to the output file. Likewise, CAZy families in the `sort_fam` not in the `kmer_message`, these CAZy families are **not** written out to the output file.
<p></p>
</div>

This is why it is possible to find an output like follows in the output file of eCAMI (where the lines of kmers have been removed:
```
>ATY67289.1 A9K55_000120     CBM48:20    CBM48:339|GH13_8:324|2.4.1.18:32|GH13:4|GT4:2

>ATY67289.1 A9K55_000120     GH13:203    GH13_8:577|CBM48:328|2.4.1.18:32|GH13:13|GT4:1
```
In this example, 'GT4' has been listed in both `kmer_messages`, and the `kmer_messages` share similar content, but the family 'GT4' is not listed in 'fam_name:group_number' for the protein in the output file. Therefore, the actualy CAZy family prediction from eCAMI is written out in the 'fam_name:group_number' field. This field always contains a CAZy family, and not a subfamily. However, to retrieve a potential subfamily prediction, we can check the `kmer_message` on the respective row and check if it is the child of the CAZy family listed in the 'fam_name:group_number' field. If it is, then we can retrive it as the CAZy subfamily prediction.

The EC numbers listed for each predicted CAZyme domain (unique CAZy family) are not predicted EC numbers. To predict the EC numbers can be done using eCAMI but that is not of interest to this project. However, the EC numbers from the `kmer_message` will be retrieved and assigned to their respective CAZyme domain. But, it will be noted that these associations are potentially weak and only indicative and there is not the same confidence level in them as would be if using eCAMI to specifically predict the EC numbers. These EC numbers are retrieved chiefly for the reason that other researchers may want them, thus the functionality is included now to save potential time in the future.


<div class="alert-warning">
<p></p>
<b>Note:</b> **NOTE:** In the eCAMI README the `important_k_mer_number` is said to be default to 3, this is incorrect. In the script `eCAMI.prediction.py` it is set at a default value of 5
<p></p>
</div>

<div class="alert-success">
<p></p>
<p>Therefore, all the desired data we want from eCAMI is in the lines prefixed with '>'.</p>
<p></p>
</div>

<a id="data_model"></a>

## Defining the data model

The explanation of the design of the data model is explained in the note book prediction_data_model.ipynb.


In [2]:
class CazymeDomain:
    """Single CAZyme domain in a protein, predicted by a CAZyme prediction tool.

    Each unique CAZy domain per protein is identifiable by a unique CAZy
    family-subfamily combination.

    Every CAZyme domain has a source CAZyme prediction tool that predicted the CAZyme
    domain, a parent CAZyme protein (represented by the protein accession), and CAZy
    family and subfamily combination. If no CAZy subfamily is predicted, the 
    subfamily will be listed as a null value.

    Hotpep, CUPP and eCAMI predict EC numbers for each CAZyme domain.
    HMMER and CUPP predict amino acid domain ranges.
    Multiple EC numbers and domain ranges can be predicted for a single CAZyme domain,
    therefore, these attritbutes are stored as lists.
    """

    def __init__(
        self,
        prediction_tool,
        protein_accession,
        cazy_family,
        cazy_subfamily=None,
        ec_numbers=None,
        domain_range=None,
    ):
        """Initiate instance
        
        :attr prediction_tool: str, CAZyme prediciton tool which predicted the domain
        :attr protein_accession: str
        :attr cazy_family: str
        :attr cazy_subfamily: str
        :attr ec_numbers: list (list of str, each str contains a unique EC number)
        :attr domain_range: list (list of str, each str contains a unique domain range)
        """
        self.prediction_tool = prediction_tool
        self.protein_accession = protein_accession
        self.cazy_family = cazy_family
        
        # not all CAZyme domans are catalogued under a CAZy subfamily
        if cazy_subfamily is None:
            self.cazy_subfamiy = np.nan
        else:
            self.cazy_subfamily = cazy_subfamily
        
        # EC numbers are not predicted for all CAZyme domains
        if ec_numbers is None:
            self.ec_numbers = []  # enables adding in EC numbers included in another line of the output file
        else:
            self.ec_numbers = ec_numbers
        
        # Not all prediction tools predict CAZyme domains
        if domain_range is None:
            self.domain_range = []  # enables adding domain range listed in another line of the ouput file
        else:
            self.domain_range = domain_range
    
    def __str__(self):
        return f"-CazymeDomain in {self.protein_accession}, fam={self.cazy_family}, subfam={self.cazy_subfamily}-"
    
    def __repr__(self):
        return f"<CazymeDomain parent={self.protein_accession} fam={self.cazy_family}, subfam={self.cazy_subfamily}>"

    
class CazymeProteinPrediction:
    """Single protein and CAZyme/non-CAZyme prediction by a CAZyme prediction tool"""
    
    def __init__(self, prediction_tool, protein_accession, cazyme_classification, cazyme_domains=None):
        """Initate class instance.
        
        :attr prediction_tool: str, name of CAZyme prediction tool
        :attr protein_accession: str
        :attr cazyme_classification: int, 1=CAZyme, 0=non-CAZyme
        :attr cazyme_domains: list of CazymeDomain instances, domain predicted to be in the CAZyme
        """
        self.prediction_tool = prediction_tool
        self.protein_accession = protein_accession
        self.cazyme_classification = cazyme_classification  # CAZyme=1, non-CAZyme=0
        
        # non-CAZymes will have no cazyme_domains
        if cazyme_domains is None:
            self.cazyme_domains = []  # allows addition of CAZyme domains as they are parsed
        else:
            self.cazyme_domains = cazyme_domains
    
    def __str__(self):
        if self.cazyme_classification == 0:
            return f"-CazymeProteinPrediction, protein={self.protein_accession}, non-CAZyme-"
        else:
            return(
                f"-CazymeProteinPrediction, protein={self.protein_accession}, "
                f"CAZyme domains={len(self.cazyme_domains)}-"
            )
    
    def __repr__(self):
        return(
            f"<CazymeProteinPrediction, protein={self.protein_accession}, "
            f"cazyme_classification{self.cazyme_classification}>"
        )


<a id="explore_one"></a>

## Exploring the first line of protein data

First lets check out the data is separated and presented when using Python to read the text file. Spliting up the data weill prepare it for adding it to the data model. Additionally, we create a dictionary, keyed by protein accession and valued by its corresponding ECAMIprediciton instance. This is to enable quickly checking if a ECAMIprediction instance has already been created for a protein.

In [3]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

for line in ecami_file[:4]:
    print(line)

>ATY67280.1 A9K55_000020 	GH17:43	GH17:24|3.2.1.-:1
LHKVFPGM(369),HKVFPGMD(370),KVFPGMDY(371),VFPGMDYT(372),FPGMDYTP(373),PPSQNNVT(392),PSQNNVTR(393),SQNNVTRD(394),QNNVTRDV(395),NNVTRDVA(396),NVTRDVAV(397),VTRDVAVL(398),TRDVAVLS(399),RDVAVLSQ(400),DVAVLSQL(401),AVLSQLTN(403),IRLYGTDC(412),RLYGTDCN(413),LYGTDCNQ(414),YGTDCNQT(415),GTDCNQTQ(416),TDCNQTQM(417),IVANEILF(476),VANEILFR(477),LPVATSDL(510),PVATSDLG(511),VATSDLGD(512),ATSDLGDD(513),TSDLGDDW(514),IMANIHPF(532),MANIHPFF(533),ISETGWPS(573),NGTNYFWF(618),GTNYFWFE(619),TNYFWFEA(620),NYFWFEAF(621),YFWFEAFD(622),FWFEAFDE(623),WFEAFDEP(624),FEAFDEPW(625),EAFDEPWK(626),WEDKWGLL(643),GVKIPDCG(659),VKIPDCGG(660),
>ATY67467.1 A9K55_000078 	GT90:15	GT90:8
RRGRHPPP(101),RGRHPPPG(102),GRHPPPGF(103),RHPPPGFD(104),AQDPCLQP(311),QDPCLQPH(312),DPCLQPHL(313),PCLQPHLR(314),PLFGGSKL(341),LPDVDGNS(503),DGNSYSAR(507),GNSYSARW(508),NSYSARWR(509),SMPLKATI(523),MPLKATIY(524),TPWVHFVP(540),PWVHFVPF(541),WVHFVPFD(542),LYVWRLLL(592),YVWRLLLE(593),


We are only interested in first line, which is identifable by starting with a '>'

In [4]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

for line in ecami_file[:4]:
    if line.startswith('>'):
        print(line)

>ATY67280.1 A9K55_000020 	GH17:43	GH17:24|3.2.1.-:1
>ATY67467.1 A9K55_000078 	GT90:15	GT90:8


In [5]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

ecami_predictions = {}  # stores proteins {protein_accession:ECAMIprediction_instance}

for line in ecami_file[:4]:
    if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
        continue
    
    prediction_output = line.split("\t")
    print(prediction_output)


['>ATY67280.1 A9K55_000020 ', 'GH17:43', 'GH17:24|3.2.1.-:1']
['>ATY67467.1 A9K55_000078 ', 'GT90:15', 'GT90:8']


<a id="cazy_fam"></a>

## Getting the predicted CAZy family for the CAZyme domain

The CAZy family of the CAZyme domain of interest in the line starting with '>' is stored in the second element in `prediction_output` list.

In [6]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

ecami_predictions = {}  # stores proteins {protein_accession:ECAMIprediction_instance}

for line in ecami_file[:8]:
    if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
        continue
    
    prediction_output = line.split("\t")
    
    # retrieve the CAZy family
    cazy_fam = prediction_output[1]
    print("cazy_fam=", cazy_fam)

cazy_fam= GH17:43
cazy_fam= GT90:15
cazy_fam= CBM48:20
cazy_fam= GH13:203


The CAZy family is stored in the format "fam_name:group_number", where the group_number is the number of the k-mer group, which is data that is not needed for evaluating the performance of eCAMI. We can remove the k-mer group number by spliting the "fam_name:group_number" string at the colon, and retrieving the first item.

In [7]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

ecami_predictions = {}  # stores proteins {protein_accession:ECAMIprediction_instance}

for line in ecami_file[:8]:
    if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
        continue
    
    prediction_output = line.split("\t")
    
    # retrieve the CAZy family
    cazy_fam = prediction_output[1].split(":")[0]
    print("cazy_fam=", cazy_fam)

cazy_fam= GH17
cazy_fam= GT90
cazy_fam= CBM48
cazy_fam= GH13


<a id="cazy_subfam"></a>

## Getting the predicted CAZy subfamily

**Note:** For details about why the CAZy subfamily is retrieved from please refere back to the section [
Exploring the eCAMI output text file](#text_file).

The CAZy subfamily is stored in the third element of the first protein line (when the line is split up by tabs), this referred to as the 'subfam_name_of_the_group:subfam_name_count' in the eCAMI documentation. This string is composed of known labels associated with the proteins used to create the kmer cluster library, which was then used to predicted the CAZy families of our query protein sequences.

Therefore, in the line  
`>ATY67289.1 A9K55_000120     GH13:203    GH13_8:577|CBM48:328|2.4.1.18:32|GH13:13|GT4:1`, the string `GH13_8:577|CBM48:328|2.4.1.18:32|GH13:13|GT4:1`   
contains the CAZy subfamily of the CAZyme domain represented by the line.

We need to check if the third element of the `prediction_output` list contains any subfamiles (identified by the general format `\D{2,3}\d+_\d+`) and that the parent family matches the family listed in the second element of the `prediction_output` list.


In [8]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

ecami_predictions = {}  # stores proteins {protein_accession:ECAMIprediction_instance}

for line in ecami_file[:8]:
    if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
        continue
    
    prediction_output = line.split("\t")
    
    # retrieve the CAZy family
    cazy_fam = prediction_output[1].split(":")[0]
    
    # retrieve the CAZy subfamily
    subfam_group = prediction_output[2]
    print(subfam_group)

GH17:24|3.2.1.-:1
GT90:8
CBM48:339|GH13_8:324|2.4.1.18:32|GH13:4|GT4:2
GH13_8:577|CBM48:328|2.4.1.18:32|GH13:13|GT4:1


The items in the "subfam_name_of_the_group:subfam_name_count" section are separated by '|'. Therefore, we need to separte the predicted items and then check which are CAZy subfamilies.

In [9]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

ecami_predictions = {}  # stores proteins {protein_accession:CazymeProteinPrediction_instance}

for line in ecami_file[:8]:
    if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
        continue
    
    prediction_output = line.split("\t")
    
    # retrieve the CAZy family
    cazy_fam = prediction_output[1].split(":")[0]
    
    # retrieve the CAZy subfamily
    subfam_group = prediction_output[2]
    subfam_group = subfam_group.split("|")
    print(subfam_group)

['GH17:24', '3.2.1.-:1']
['GT90:8']
['CBM48:339', 'GH13_8:324', '2.4.1.18:32', 'GH13:4', 'GT4:2']
['GH13_8:577', 'CBM48:328', '2.4.1.18:32', 'GH13:13', 'GT4:1']


The EC numbers and CAZy (sub)families are also stored with the associated group number, e.g. the group number of the predicted EC number. These group numbers are separated by a colon from the EC numbers and CAZy (sub)families.

In [10]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

ecami_predictions = {}  # stores proteins {protein_accession:CazymeProteinPrediction_instance}

for line in ecami_file[:8]:
    if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
        continue
    
    prediction_output = line.split("\t")
    
    # retrieve the CAZy family
    cazy_family = prediction_output[1].split(":")[0]
    
    # retrieve the CAZy subfamily
    subfam_group = prediction_output[2]
    subfam_group = subfam_group.split("|")
    cazy_subfamilies = []  # store all listed predicted CAZy subfamilies
    # check if the subfam_group contains a predicted CAZy subfamily
    for item in subfam_group:
        # remove the group number of the predicted item
        item = item.split(":")[0]
        try:
            re.match(r"\D{2,3}\d+_\d+", item)
            # check if the subfamily belongs to the CAZy family listed for the current CAZyme domain
            if item[:item.find("_")] == cazy_family:
                cazy_subfamilies.append(item)
        except AttributeError:  # raised if the item is not a CAZy subfamily
            pass
    
    print("cazy_family=", cazy_family)
    print("cazy_subfamilies=", cazy_subfamilies)
            

cazy_family= GH17
cazy_subfamilies= []
cazy_family= GT90
cazy_subfamilies= []
cazy_family= CBM48
cazy_subfamilies= []
cazy_family= GH13
cazy_subfamilies= ['GH13_8']


<a id="ecnum"></a>

## Getting the EC numbers

As noted and explained in the section [Exploring the eCAMI output text file](#text_file), these are not the EC numbers specifically predicted by eCAMI. To predict the EC numbers of the query proteins, the analysis would need to be run again using the 'EC' kmer library. However, the EC numbers listed in the `kmer_message` (for this definition see the section 'Exploring the eCAMI output text file')....

The predicted EC numbers for the current working CAZyme domain are listed in the subfam_group (the third element of the `prediction_output` list), along with the CAZy subfamilies. Therefore, the identification of the EC numbers can be achieved at the same time as getting the CAZy subfamilies. If digits are missing in the EC number, they are replaced with a dash, from this we are able to check by the format of the string if a string includes an EC number.

In [11]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

ecami_predictions = {}  # stores proteins {protein_accession:ECAMIprediction_instance}

for line in ecami_file[:8]:
    if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
        continue
    
    prediction_output = line.split("\t")
    
    # retrieve the CAZy family
    cazy_family = prediction_output[1].split(":")[0]
    
    # retrieve the CAZy subfamily and EC numbers
    subfam_group = prediction_output[2]
    subfam_group = subfam_group.split("|")

    cazy_subfamilies = []  # store all listed predicted CAZy subfamilies
    ec_numbers = []  # store all predicted EC numbers

    # check if the subfam_group contains a predicted CAZy subfamily
    for item in subfam_group:
        # remove the group number of the predicted item
        item = item.split(":")[0]

        # check if the item contains a CAZy subfamily
        try:
            re.match(r"\D{2,3}\d+_\d+", item).group()
            # check if the subfamily belongs to the CAZy family listed for the current CAZyme domain
            if item[:item.find("_")] == cazy_family:
                cazy_subfamilies.append(item)
        except AttributeError:  # raised if the item is not a CAZy subfamily
            pass
        
        # check if item contains an EC number
        try:
            re.match(r"\d+\.(\d+|-)\.(\d+|-)\.(\d+|-)", item).group()
            ec_numbers.append(item)
        except AttributeError:  # raised if the item is not an EC number
            pass
    
    print("cazy_family=", cazy_family)
    print("cazy_subfamilies=", cazy_subfamilies)
    print("ec_numbers=", ec_numbers)
            

cazy_family= GH17
cazy_subfamilies= []
ec_numbers= ['3.2.1.-']
cazy_family= GT90
cazy_subfamilies= []
ec_numbers= []
cazy_family= CBM48
cazy_subfamilies= []
ec_numbers= ['2.4.1.18']
cazy_family= GH13
cazy_subfamilies= ['GH13_8']
ec_numbers= ['2.4.1.18']


<a id="combining_proteins"></a>

## Combining data for a single protein

As discussed before, a unique protein can be listed multiple times in the eCAMI output file. To combine data together so a single ECAMIprediction instance for a protein, we check if a corresponding key (defined by the protein's accession) is in the ecami_prediction dictionary.

In [12]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

ecami_predictions = {}  # stores proteins {protein_accession:CazymeProteinPrediction_instance}

for line in tqdm(ecami_file, desc="Parsing eCAMI output"):
    if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
        continue
    
    prediction_output = line.split("\t")
    
    # retrieve protein accession, removing the '>' prefix and stripping white space
    protein_accession = prediction_output[0][1:].strip()
    cazyme_classification = 1
    
    # retrieve the CAZy family
    cazy_family = prediction_output[1].split(":")[0]
    
    # retrieve the CAZy subfamily and EC numbers
    subfam_group = prediction_output[2]
    subfam_group = subfam_group.split("|")

    cazy_subfamilies = []  # store all listed predicted CAZy subfamilies
    ec_numbers = []  # store all predicted EC numbers

    # check if the subfam_group contains a predicted CAZy subfamily
    for item in subfam_group:
        # remove the group number of the predicted item
        item = item.split(":")[0]

        # check if the item contains a CAZy subfamily
        try:
            re.match(r"\D{2,3}\d+_\d+", item).group()
            # check if the subfamily belongs to the CAZy family listed for the current CAZyme domain
            if item[:item.find("_")] == cazy_family:
                cazy_subfamilies.append(item)
            continue  # it was a CAZy subfamily, don't need to check if it was an EC#
    
        except AttributeError:  # raised if the item is not a CAZy subfamily
            pass
        
        # check if item contains an EC number
        try:
            re.match(r"\d+\.(\d+|-)\.(\d+|-)\.(\d+|-)", item).group()
            ec_numbers.append(item)
        except AttributeError:  # raised if the item is not an EC number
            # check if it was a CAZy family, this is for detecting unexpected data types
            try:
                re.match(r"\D{2,3}\d+", item).group()
            except AttributeError:
                print(
                    f"WARNING -- unexpected data type of item '{item}' for protein {protein_accession} "
                    "by eCAMI. Not adding data to parsed data from eCAMI."
                )
    
    # add ECAMIprediction instance to ecami_prediction
    cazyme_classification = 1  # all proteins included in eCAMI output file are identified as CAZymes
    
    if len(cazy_subfamilies) == 0:
        cazy_subfamilies = [np.nan]
    else:
        cazy_subfamilies = ", ".join(cazy_subfamilies)  # convert to string
        
    if len(ec_numbers) == 0:
        ec_numbers = [np.nan]
    
    try:
        # add new predicted CAZyme domain to an existing ECAMIprediction instance
        existing_prediction = ecami_predictions[protein_accession]
        print("existing_prediction=", existing_prediction)
        
        # check if the CAZyme domain has already been passed
        existing_cazyme_domains = existing_prediction.cazyme_domains
        existance = False  # becomes true if existing instance found
        for domain in existing_cazyme_domains:
            if (domain.cazy_family == cazy_family) and (domain.cazy_subfamily == cazy_subfamily):
                domain.ec_numbers.append(np.nan)
                existance = True
                print(f"WARNING -- existing cazyme domain found, {protein_accession}, {domain.cazy_family}, {domain.cazy_subfamily}")
                break
        
        if existance is False:
            # create new CAZyme domain
            new_cazyme_domain = CazymeDomain(
                "eCAMI",
                protein_accession,
                cazy_family,
                cazy_subfamilies,
                ec_numbers,
            )
            existing_prediction.cazyme_domains.append(new_cazyme_domain)
        print("existing_prediction=[2]", existing_prediction)
    
    except KeyError:  # raised if no corresponding ECAMIprediction instance was found

            new_cazyme_domain = CazymeDomain(
                "eCAMI",
                protein_accession,
                cazy_family,
                cazy_subfamilies,
                ec_numbers,
            )
            
            new_protein = CazymeProteinPrediction(
                "eCAMI",
                protein_accession,
                cazyme_classification,
            )
    
            new_protein.cazyme_domains = [new_cazyme_domain]
            
            ecami_predictions[protein_accession] = new_protein



print("length=", len(list(ecami_predictions.values())))

HBox(children=(HTML(value='Parsing eCAMI output'), FloatProgress(value=0.0, max=644.0), HTML(value='')))

existing_prediction= -CazymeProteinPrediction, protein=ATY67289.1 A9K55_000120, CAZyme domains=1-
existing_prediction=[2] -CazymeProteinPrediction, protein=ATY67289.1 A9K55_000120, CAZyme domains=2-
existing_prediction= -CazymeProteinPrediction, protein=ATY66529.1 A9K55_000499, CAZyme domains=1-
existing_prediction=[2] -CazymeProteinPrediction, protein=ATY66529.1 A9K55_000499, CAZyme domains=2-
existing_prediction= -CazymeProteinPrediction, protein=ATY66664.1 A9K55_000997, CAZyme domains=1-
existing_prediction=[2] -CazymeProteinPrediction, protein=ATY66664.1 A9K55_000997, CAZyme domains=2-
existing_prediction= -CazymeProteinPrediction, protein=ATY66444.1 A9K55_001136, CAZyme domains=1-
existing_prediction=[2] -CazymeProteinPrediction, protein=ATY66444.1 A9K55_001136, CAZyme domains=2-
existing_prediction= -CazymeProteinPrediction, protein=ATY65637.1 A9K55_001791, CAZyme domains=1-
existing_prediction=[2] -CazymeProteinPrediction, protein=ATY65637.1 A9K55_001791, CAZyme domains=2-
exist

From the print out we can see we are picking up when multiple CAZyme domains are predicted for the same protein. Remember that each CAZyme domain prediction covers 2 lines, therefore, the 644 lines from the output file contains 322 predicted CAZyme domains, some of which are from the same CAZyme.

<a id="no_caz"></a>

## Addining non-CAZymes

The eCAMI output includes all the proteins that eCAMI has predicted at CAZymes. For evaluating the non-CAZyme/CAZyme differentiation performance we need the CAZyme and non-CAZyme classification. Therefore, we can parse the FASTA file that containing the protein sequences used as input by eCAMI and add those that have not been including in the eCAMI output file as non-CAZymes.

In [13]:
# Read the output file from eCAMI
with open("8080495_output.txt", "r") as fh:
    ecami_file = fh.read().splitlines()

ecami_predictions = {}  # stores proteins {protein_accession:CazymeProteinPrediction_instance}

for line in tqdm(ecami_file, desc="Parsing eCAMI output"):
    if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
        continue
    
    prediction_output = line.split("\t")
    
    # retrieve protein accession, removing the '>' prefix and stripping white space
    protein_accession = prediction_output[0][1:].strip()
    cazyme_classification = 1
    
    # retrieve the CAZy family
    cazy_family = prediction_output[1].split(":")[0]
    
    # retrieve the CAZy subfamily and EC numbers
    subfam_group = prediction_output[2]
    subfam_group = subfam_group.split("|")

    cazy_subfamilies = []  # store all listed predicted CAZy subfamilies
    ec_numbers = []  # store all predicted EC numbers

    # check if the subfam_group contains a predicted CAZy subfamily
    for item in subfam_group:
        # remove the group number of the predicted item
        item = item.split(":")[0]

        # check if the item contains a CAZy subfamily
        try:
            re.match(r"\D{2,3}\d+_\d+", item).group()
            # check if the subfamily belongs to the CAZy family listed for the current CAZyme domain
            if item[:item.find("_")] == cazy_family:
                cazy_subfamilies.append(item)
            continue  # it was a CAZy subfamily, don't need to check if it was an EC#
    
        except AttributeError:  # raised if the item is not a CAZy subfamily
            pass
        
        # check if item contains an EC number
        try:
            re.match(r"\d+\.(\d+|-)\.(\d+|-)\.(\d+|-)", item).group()
            ec_numbers.append(item)
        except AttributeError:  # raised if the item is not an EC number
            # check if it was a CAZy family, this is for detecting unexpected data types
            try:
                re.match(r"\D{2,3}\d+", item).group()
            except AttributeError:
                print(
                    f"WARNING -- unexpected data type of item '{item}' for protein {protein_accession} "
                    "by eCAMI. Not adding data to parsed data from eCAMI."
                )
    
    # add ECAMIprediction instance to ecami_prediction
    cazyme_classification = 1  # all proteins included in eCAMI output file are identified as CAZymes
    
    if len(cazy_subfamilies) == 0:
        cazy_subfamilies = [np.nan]
    else:
        cazy_subfamilies = ", ".join(cazy_subfamilies)  # convert to string
        
    if len(ec_numbers) == 0:
        ec_numbers = [np.nan]
    
    try:
        # add new predicted CAZyme domain to an existing ECAMIprediction instance
        existing_prediction = ecami_predictions[protein_accession]
        
        # check if the CAZyme domain has already been passed
        existing_cazyme_domains = existing_prediction.cazyme_domains
        existance = False  # becomes true if existing instance found
        for domain in existing_cazyme_domains:
            if (domain.cazy_family == cazy_family) and (domain.cazy_subfamily == cazy_subfamily):
                domain.ec_numbers.append(np.nan)
                existance = True
                print(f"WARNING -- existing cazyme domain found, {protein_accession}, {domain.cazy_family}, {domain.cazy_subfamily}")
                break
        
        if existance is False:
            # create new CAZyme domain
            new_cazyme_domain = CazymeDomain(
                "eCAMI",
                protein_accession,
                cazy_family,
                cazy_subfamilies,
                ec_numbers,
            )
            existing_prediction.cazyme_domains.append(new_cazyme_domain)
    
    except KeyError:  # raised if no corresponding ECAMIprediction instance was found

            new_cazyme_domain = CazymeDomain(
                "eCAMI",
                protein_accession,
                cazy_family,
                cazy_subfamilies,
                ec_numbers,
            )
            
            new_protein = CazymeProteinPrediction(
                "eCAMI",
                protein_accession,
                cazyme_classification,
            )
    
            new_protein.cazyme_domains = [new_cazyme_domain]
            
            ecami_predictions[protein_accession] = new_protein

print("length pre-non-CAZymes =", len(list(ecami_predictions.values())))
            
# Add non-CAZymes
# open the FASTA file containing the input protein sequences
with open("genbank_proteins_txid73501_GCA_008080495_1.fasta", "r") as fh:
    fasta = fh.read().splitlines()
count = 0
for line in tqdm(fasta, desc="Adding non-CAZymes"):
    if line.startswith(">"):
        count += 1
        # remove '>' prefix and white space
        protein_accession = line[1:].strip()
        
        # check if the protein is already listed in the eCAMI predictions
        try:
            ecami_predictions[protein_accession]
        except KeyError:  # raised of protein not in ecami_predictions
            cazyme_classification = 0
            ecami_predictions[protein_accession] = CazymeProteinPrediction(
                "eCAMI",
                protein_accession,
                cazyme_classification,
            )

print("proteins parsed =", len(list(ecami_predictions.values())))
print("proteins in the input FASTA file=", count)

HBox(children=(HTML(value='Parsing eCAMI output'), FloatProgress(value=0.0, max=644.0), HTML(value='')))


length pre-non-CAZymes = 289


HBox(children=(HTML(value='Adding non-CAZymes'), FloatProgress(value=0.0, max=93219.0), HTML(value='')))


proteins parsed = 9287
proteins in the input FASTA file= 9287


From counting the proteins in the FASTA file we can see that all proteins are retrieved, and every protein is represented by a `CazymeProteinPrediction` instance.

<a id="final"></a>

## Final functions

We can see above that all the proteins that were contained in the FASTA file, which was used as input by eCAMI, are parsed by the method above. Now we can break this up into functions that are suitable for dropping straigt into `pyrewton` and are easier to follow.

In [14]:
def parse_ecami_output(ecami_output_path, fasta_path):
    """Parse the output from eCAMI, retrieving predicted CAZyme domains, CAZy (sub)families and EC numbers.
    
    :param ecami_output_path: Path, output text file from eCAMI
    :param fasta_path: Path, FASTA file used as input by eCAMI
    
    Returns a list of ECAMIprediction instances. One instance per protein in the FASTA file used as eCAMI input"""
    # Read the output file from eCAMI
    with open(ecami_output_path, "r") as fh:
        ecami_file = fh.read().splitlines()

    ecami_predictions = {}  # stores proteins {protein_accession:ECAMIprediction_instance}

    for line in tqdm(ecami_file, desc="Parsing eCAMI output"):
        if not line.startswith('>'):  # only want the first line for a protien not the listed k-mers
            continue

        prediction_output = line.split("\t")

        # retrieve protein accession, removing the '>' prefix and stripping white space
        protein_accession = prediction_output[0][1:].strip()

        # retrieve the CAZy family
        cazy_family = prediction_output[1].split(":")[0].strip()

        # retrieve the children CAZy subfamily of the CAZy family and EC number annotations
        cazy_subfamilies, ec_numbers = get_subfamily_ec_numbers(prediction_output[2], cazy_family)

        # add ECAMIprediction instance to ecami_prediction
        cazyme_classification = 1  # all proteins included in eCAMI output file are identified as CAZymes

        try:
            # add new predicted CAZyme domain to an existing ECAMIprediction instance
            existing_prediction = ecami_predictions[protein_accession]

            # check if the CAZyme domain has already been passed
            existing_cazyme_domains = existing_prediction.cazyme_domains
            existance = False
            while existance is False:
                for domain in existing_cazyme_domains:
                    if (domain.cazy_family == cazy_family) and (domain.cazy_subfamily == cazy_subfamily):
                        # Domain has been parsed previously, check if additional EC numbers were retrieved
                        for ec in ec_numbers:
                            if ec not in domain.ec_numbers:
                                domain.ec_numbers.append(ec)
                        existance = True
                # come to the end of existing domains and none match the new CAZyme domain
                break
                

            if existance is False:
                # create new CAZyme domain
                new_cazyme_domain = CazymeDomain(
                    prediction_tool="eCAMI",
                    protein_accession=protein_accession,
                    cazy_family=cazy_family,
                    cazy_subfamily=cazy_subfamilies,
                    ec_numbers=ec_numbers,
                )
                existing_prediction.cazyme_domains.append(new_cazyme_domain)

        except KeyError:  # raised if no corresponding ECAMIprediction instance was found
                new_cazyme_domain = CazymeDomain(
                    prediction_tool="eCAMI",
                    protein_accession=protein_accession,
                    cazy_family=cazy_family,
                    cazy_subfamily=cazy_subfamilies,
                    ec_numbers=ec_numbers,
                )

                new_protein = CazymeProteinPrediction(
                    prediction_tool="eCAMI",
                    protein_accession=protein_accession,
                    cazyme_classification=cazyme_classification,
                )

                new_protein.cazyme_domains.append(new_cazyme_domain)
                
                ecami_predictions[protein_accession] = new_protein

    ecami_predictions = add_non_cazymes(ecami_predictions, fasta_path)
    
    return list(ecami_predictions.values())

                
def get_subfamily_ec_numbers(subfam_group, cazy_family):
    """Retrieve the predicted CAZy subfamily and associated EC numbers.
    
    Retrieves only the child CAZy subfamilies for the CAZy familiy of the current working
    CAZyme domain in the protein. Returns the CAZy subfamilies as a string, and EC numbers
    as a list of string, with each string containing a unique EC number. The 'subfam_group'
    refers to the 'subfam_name_of_the_group:subfam_name_count' in the eCAMI output file.
    
    :param subfam_group: string, the 'subfam_name_of_the_group:subfam_name_count'
    :param cazy_family: string, CAZy family for the current working CAZyme domain
    
    Return the CAZy subfamily for the CAZyme domain (str) and list of associated EC numbers.
    """
    cazy_subfamilies = []  # store all listed predicted CAZy subfamilies
    ec_numbers = []  # store all predicted EC numbers

    # individual items are separated by "|"
    subfam_group = subfam_group.split("|")

    # check if the subfam_group contains a predicted CAZy subfamily
    for item in subfam_group:
        # remove the group number of the predicted item
        item = item.split(":")[0]

        # check if the item contains a CAZy subfamily
        try:
            re.match(r"\D{2,3}\d+_\d+", item).group()
            # check if the subfamily belongs to the CAZy family listed for the current CAZyme domain
            if item[:item.find("_")] == cazy_family:
                cazy_subfamilies.append(item)

        except AttributeError:  # raised if the item is not a CAZy subfamily
            pass

        # check if item contains an EC number
        try:
            re.match(r"\d+\.(\d+|-)\.(\d+|-)\.(\d+|-)", item).group()
            ec_numbers.append(item)

        except AttributeError:  # raised if the item is not an EC number
            pass

    cazy_subfamilies = ", ".join(cazy_subfamilies)  # convert to string

    return cazy_subfamilies, ec_numbers


def add_non_cazymes(ecami_predictions, fasta_path):
    """Add non-CAZymes to the parsed eCAMI output.
    
    The eCAMI output only includes the protein it predicts are CAZymes. Therefore, this function
    goes through the FASTA file that was used as input by eCAMI and adds the proteins not
    classified as CAZymes to the parsed proteins.
    
    :param ecami_predictions: dict, keyed by protein accession and valued by ECAMIprediction instance
    :param fasta_path: Path, FASTA file used as input by eCAMI
    
    Return a dictionary keyed by protein accession and valued by corresponding ECAMIprediction instance.
    """
    # Add non-CAZymes
    # open the FASTA file containing the input protein sequences
    with open(fasta_path, "r") as fh:
        fasta = fh.read().splitlines()

    for line in tqdm(fasta, desc="Adding non-CAZymes"):
        if line.startswith(">"):

            # remove '>' prefix and white space
            protein_accession = line[1:].strip()

            # check if the protein is already listed in the eCAMI predictions
            try:
                ecami_predictions[protein_accession]

            except KeyError:  # raised of protein not in ecami_predictions
                cazyme_classification = 0
                ecami_predictions[protein_accession] = CazymeProteinPrediction(
                    "eCAMI",
                    protein_accession,
                    cazyme_classification,
                )
    
    return ecami_predictions

results = parse_ecami_output("8080495_output.txt", "genbank_proteins_txid73501_GCA_008080495_1.fasta")
print("proteins parsed=", len(results))
print("proteins in the input FASTA file=", 9287)
print("output=", results[0])

HBox(children=(HTML(value='Parsing eCAMI output'), FloatProgress(value=0.0, max=644.0), HTML(value='')))




HBox(children=(HTML(value='Adding non-CAZymes'), FloatProgress(value=0.0, max=93219.0), HTML(value='')))


proteins parsed= 9287
proteins in the input FASTA file= 9287
output= -CazymeProteinPrediction, protein=ATY67280.1 A9K55_000020, CAZyme domains=1-


From before we know there are 9,287 proteins in the FASTA file that was parsed by eCAMI, so we can see that all proteins are now stored in the list `results`. We can iterate through the results to makes sure that we retained the CAZyme domain predictions.

In [15]:
for result in results:
    print(f"{result.protein_accession}, classification={result.cazyme_classification}, domains={len(result.cazyme_domains)}")

ATY67280.1 A9K55_000020, classification=1, domains=1
ATY67467.1 A9K55_000078, classification=1, domains=1
ATY67289.1 A9K55_000120, classification=1, domains=2
ATY67196.1 A9K55_000147, classification=1, domains=1
ATY67226.1 A9K55_000162, classification=1, domains=1
ATY67472.1 A9K55_000170, classification=1, domains=1
ATY67327.1 A9K55_000335, classification=1, domains=1
ATY67094.1 A9K55_000365, classification=1, domains=1
ATY67306.1 A9K55_000368, classification=1, domains=1
ATY67182.1 A9K55_000377, classification=1, domains=1
ATY67158.1 A9K55_000434, classification=1, domains=1
ATY67425.1 A9K55_000448, classification=1, domains=1
ATY67138.1 A9K55_000458, classification=1, domains=1
ATY66529.1 A9K55_000499, classification=1, domains=2
ATY66624.1 A9K55_000551, classification=1, domains=1
ATY67020.1 A9K55_000635, classification=1, domains=1
ATY66542.1 A9K55_000689, classification=1, domains=1
ATY66876.1 A9K55_000707, classification=1, domains=1
ATY66875.1 A9K55_000708, classification=1, dom

ATY66229.1 A9K55_001601, classification=0, domains=0
ATY66228.1 A9K55_001602, classification=0, domains=0
ATY66227.1 A9K55_001603, classification=0, domains=0
ATY66234.1 A9K55_001604, classification=0, domains=0
ATY66233.1 A9K55_001605, classification=0, domains=0
ATY66232.1 A9K55_001606, classification=0, domains=0
ATY66231.1 A9K55_001607, classification=0, domains=0
ATY66236.1 A9K55_001608, classification=0, domains=0
ATY66235.1 A9K55_001609, classification=0, domains=0
ATY65693.1 A9K55_001610, classification=0, domains=0
ATY65694.1 A9K55_001611, classification=0, domains=0
ATY65695.1 A9K55_001612, classification=0, domains=0
ATY65696.1 A9K55_001614, classification=0, domains=0
ATY65697.1 A9K55_001615, classification=0, domains=0
ATY65698.1 A9K55_001616, classification=0, domains=0
ATY65699.1 A9K55_001617, classification=0, domains=0
ATY65700.1 A9K55_001618, classification=0, domains=0
ATY65701.1 A9K55_001619, classification=0, domains=0
ATY66270.1 A9K55_001620, classification=0, dom

ATY59226.1 A9K55_002537, classification=0, domains=0
ATY59227.1 A9K55_002538, classification=0, domains=0
ATY59228.1 A9K55_002539, classification=0, domains=0
ATY58401.1 A9K55_002540, classification=0, domains=0
ATY58400.1 A9K55_002541, classification=0, domains=0
ATY59037.1 A9K55_002542, classification=0, domains=0
ATY59036.1 A9K55_002543, classification=0, domains=0
ATY58399.1 A9K55_002544, classification=0, domains=0
ATY58398.1 A9K55_002545, classification=0, domains=0
ATY59034.1 A9K55_002546, classification=0, domains=0
ATY59033.1 A9K55_002547, classification=0, domains=0
ATY58403.1 A9K55_002548, classification=0, domains=0
ATY58402.1 A9K55_002549, classification=0, domains=0
ATY59148.1 A9K55_002551, classification=0, domains=0
ATY59146.1 A9K55_002552, classification=0, domains=0
ATY59147.1 A9K55_002553, classification=0, domains=0
ATY59151.1 A9K55_002554, classification=0, domains=0
ATY59152.1 A9K55_002555, classification=0, domains=0
ATY59149.1 A9K55_002556, classification=0, dom

ATY59503.1 A9K55_003716, classification=0, domains=0
ATY59502.1 A9K55_003717, classification=0, domains=0
ATY59507.1 A9K55_003718, classification=0, domains=0
ATY59506.1 A9K55_003719, classification=0, domains=0
ATY59022.1 A9K55_003720, classification=0, domains=0
ATY58550.1 A9K55_003721, classification=0, domains=0
ATY58549.1 A9K55_003722, classification=0, domains=0
ATY58808.1 A9K55_003723, classification=0, domains=0
ATY58806.1 A9K55_003724, classification=0, domains=0
ATY58807.1 A9K55_003725, classification=0, domains=0
ATY58804.1 A9K55_003726, classification=0, domains=0
ATY58805.1 A9K55_003727, classification=0, domains=0
ATY58802.1 A9K55_003728, classification=0, domains=0
ATY59592.1 A9K55_003730, classification=0, domains=0
ATY59591.1 A9K55_003731, classification=0, domains=0
ATY59594.1 A9K55_003732, classification=0, domains=0
ATY59593.1 A9K55_003733, classification=0, domains=0
ATY59596.1 A9K55_003734, classification=0, domains=0
ATY59595.1 A9K55_003735, classification=0, dom

ATY64451.1 A9K55_004437, classification=0, domains=0
ATY64462.1 A9K55_004438, classification=0, domains=0
ATY63980.1 A9K55_004440, classification=0, domains=0
ATY63981.1 A9K55_004441, classification=0, domains=0
ATY63978.1 A9K55_004442, classification=0, domains=0
ATY63979.1 A9K55_004443, classification=0, domains=0
ATY63976.1 A9K55_004444, classification=0, domains=0
ATY63977.1 A9K55_004445, classification=0, domains=0
ATY63974.1 A9K55_004446, classification=0, domains=0
ATY63975.1 A9K55_004447, classification=0, domains=0
ATY63983.1 A9K55_004449, classification=0, domains=0
ATY64632.1 A9K55_004450, classification=0, domains=0
ATY64631.1 A9K55_004451, classification=0, domains=0
ATY64634.1 A9K55_004452, classification=0, domains=0
ATY64633.1 A9K55_004453, classification=0, domains=0
ATY64636.1 A9K55_004454, classification=0, domains=0
ATY64635.1 A9K55_004455, classification=0, domains=0
ATY64638.1 A9K55_004456, classification=0, domains=0
ATY64637.1 A9K55_004457, classification=0, dom

ATY60359.1 A9K55_005929, classification=0, domains=0
ATY60373.1 A9K55_005931, classification=0, domains=0
ATY60091.1 A9K55_005932, classification=0, domains=0
ATY60405.1 A9K55_005933, classification=0, domains=0
ATY60411.1 A9K55_005936, classification=0, domains=0
ATY60096.1 A9K55_005937, classification=0, domains=0
ATY60413.1 A9K55_005938, classification=0, domains=0
ATY60412.1 A9K55_005939, classification=0, domains=0
ATY61171.1 A9K55_005940, classification=0, domains=0
ATY61170.1 A9K55_005941, classification=0, domains=0
ATY60158.1 A9K55_005942, classification=0, domains=0
ATY60159.1 A9K55_005943, classification=0, domains=0
ATY60160.1 A9K55_005944, classification=0, domains=0
ATY60161.1 A9K55_005945, classification=0, domains=0
ATY60154.1 A9K55_005946, classification=0, domains=0
ATY60155.1 A9K55_005947, classification=0, domains=0
ATY60780.1 A9K55_005948, classification=0, domains=0
ATY60157.1 A9K55_005949, classification=0, domains=0
ATY60166.1 A9K55_005950, classification=0, dom

ATY60263.1 A9K55_006697, classification=0, domains=0
ATY60264.1 A9K55_006698, classification=0, domains=0
ATY60508.1 A9K55_006699, classification=0, domains=0
ATY60266.1 A9K55_006700, classification=0, domains=0
ATY60265.1 A9K55_006701, classification=0, domains=0
ATY60292.1 A9K55_006702, classification=0, domains=0
ATY61148.1 A9K55_006703, classification=0, domains=0
ATY61146.1 A9K55_006705, classification=0, domains=0
ATY61145.1 A9K55_006706, classification=0, domains=0
ATY60291.1 A9K55_006707, classification=0, domains=0
ATY61144.1 A9K55_006708, classification=0, domains=0
ATY61143.1 A9K55_006709, classification=0, domains=0
ATY60295.1 A9K55_006710, classification=0, domains=0
ATY61149.1 A9K55_006711, classification=0, domains=0
ATY60357.1 A9K55_006713, classification=0, domains=0
ATY60086.1 A9K55_006714, classification=0, domains=0
ATY60087.1 A9K55_006715, classification=0, domains=0
ATY60353.1 A9K55_006716, classification=0, domains=0
ATY60354.1 A9K55_006717, classification=0, dom

ATY63335.1 A9K55_008203, classification=0, domains=0
ATY63126.1 A9K55_008204, classification=0, domains=0
ATY63127.1 A9K55_008205, classification=0, domains=0
ATY63128.1 A9K55_008206, classification=0, domains=0
ATY63129.1 A9K55_008207, classification=0, domains=0
ATY63130.1 A9K55_008208, classification=0, domains=0
ATY62667.1 A9K55_008209, classification=0, domains=0
ATY63123.1 A9K55_008210, classification=0, domains=0
ATY63124.1 A9K55_008211, classification=0, domains=0
ATY62002.1 A9K55_008212, classification=0, domains=0
ATY62001.1 A9K55_008213, classification=0, domains=0
ATY62000.1 A9K55_008214, classification=0, domains=0
ATY61999.1 A9K55_008215, classification=0, domains=0
ATY62006.1 A9K55_008216, classification=0, domains=0
ATY62005.1 A9K55_008217, classification=0, domains=0
ATY62004.1 A9K55_008218, classification=0, domains=0
ATY62003.1 A9K55_008219, classification=0, domains=0
ATY61998.1 A9K55_008220, classification=0, domains=0
ATY61997.1 A9K55_008221, classification=0, dom

ATY63372.1 A9K55_008687, classification=0, domains=0
ATY63373.1 A9K55_008688, classification=0, domains=0
ATY63374.1 A9K55_008689, classification=0, domains=0
ATY63673.1 A9K55_008690, classification=0, domains=0
ATY63674.1 A9K55_008691, classification=0, domains=0
ATY62582.1 A9K55_008692, classification=0, domains=0
ATY62581.1 A9K55_008693, classification=0, domains=0
ATY62580.1 A9K55_008694, classification=0, domains=0
ATY62579.1 A9K55_008695, classification=0, domains=0
ATY62577.1 A9K55_008697, classification=0, domains=0
ATY62576.1 A9K55_008698, classification=0, domains=0
ATY62575.1 A9K55_008699, classification=0, domains=0
ATY62585.1 A9K55_008700, classification=0, domains=0
ATY62584.1 A9K55_008701, classification=0, domains=0
ATY62319.1 A9K55_008702, classification=0, domains=0
ATY61996.1 A9K55_008703, classification=0, domains=0
ATY61469.1 A9K55_008704, classification=0, domains=0
ATY61468.1 A9K55_008705, classification=0, domains=0
ATY61465.1 A9K55_008706, classification=0, dom