# EsMeCaTa method tutorial

The idea of this tutorial is to present the method behind EsMeCaTa. So it is not a tutotial on how to use it (it will not show the command) but more the functions behind the workflow and what is done.

To show it works EsMeCaTa will be apply to the [Buchnera example file](https://github.com/AuReMe/esmecata/blob/master/test/buchnera_workflow.tsv). With this taxonomic affiliation:

`'cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Buchnera;Buchnera aphidicola'`

## EsMeCaTa proteomes

### From taxonomic affiliations to taxon ID

First EsMeCaTa iterates through the taxonomic affiliations to find the taxon ID (in the NCBI Taxonomy database using ete3) for each taxon name. As these taxon IDs will be used to search from proteomes in [UniProt Proteomes database](https://www.uniprot.org/proteomes/).

In [2]:
from esmecata.proteomes import taxonomic_affiliation_to_taxon_id
from ete3 import NCBITaxa
ncbi = NCBITaxa()

taxonomic_affiliation = 'cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Buchnera;Buchnera aphidicola'

tax_ids_to_names, taxon_ids = taxonomic_affiliation_to_taxon_id('Buchnera', taxonomic_affiliation, ncbi)

tax_ids_to_names

{2: 'Bacteria',
 32199: 'Buchnera',
 46073: 'Buchnera',
 9: 'Buchnera aphidicola',
 131567: 'cellular organisms',
 91347: 'Enterobacterales',
 1903409: 'Erwiniaceae',
 1236: 'Gammaproteobacteria',
 1224: 'Proteobacteria'}

For each taxon in the taxonomic affiliations, we have a taxon ID. If no corresponding taxon IDs were found for a taxon name, it will shown 'not found'.

But there is a risk that some taxon names are associated to multiple taxon IDs so we have to disambiguate the taxon IDs found.

### Disambiguate taxon IDs found

A taxon name can be associated to multiple taxon IDs.

For example, we will use the taxonomic affiliation `Bacteria;Gammaproteobacteria;Yersinia`. The taxon Yersinia is associated to 2 taxon IDs (one mantid (taxId: 444888) and one bacteria (taxId: 629)). EsMeCaTa will not able to differentiate them.

EsMeCaTa will compare all the taxons of the taxonomic affiliation (here: 2 (Bacteria) and 1236 (Gammaproteobacteria)) to the lineage associated with the two taxIDs (for bacteria Yersinia: `[1, 131567, 2, 1224, 1236, 91347, 1903411, 629]` and for the mantid one: `[1, 131567, 2759, 33154, 33208, 6072, 33213, 33317, 1206794, 88770, 6656, 197563, 197562, 6960, 50557, 85512, 7496, 33340, 33341, 6970, 7504, 7505, 267071, 444888]`).
In this example, there is 2 matches for the bacteria one (2 and 1236) and 0 for the mantid one. So EsMeCaTa will select the taxon ID associated with the bacteria (629).


In [3]:
from esmecata.proteomes import disambiguate_taxon, taxonomic_affiliation_to_taxon_id

yersinia_taxonomic_affiliation = 'Bacteria;Gammaproteobacteria;Yersinia'

yersinia_tax_ids_to_names, yersinia_taxon_ids = taxonomic_affiliation_to_taxon_id('Yersinia', yersinia_taxonomic_affiliation, ncbi)

yersinia_json_taxonomic_affiliations = {'Yersinia': yersinia_taxon_ids}
print('Before Disambiguation:\n', yersinia_json_taxonomic_affiliations)
yersinia_json_taxonomic_affiliations = disambiguate_taxon(yersinia_json_taxonomic_affiliations, ncbi)
print('After Disambiguation:\n', yersinia_json_taxonomic_affiliations)

Before Disambiguation:
 {'Yersinia': OrderedDict([('Bacteria', [2]), ('Gammaproteobacteria', [1236]), ('Yersinia', [629, 444888])])}
After Disambiguation:
 {'Yersinia': OrderedDict([('Bacteria', [2]), ('Gammaproteobacteria', [1236]), ('Yersinia', [629])])}


As you can see, Yersinia is now associated with only one taxonID.

With the taxonomic affiliation of Buchnera, this will not change anything.

In [4]:
from esmecata.proteomes import disambiguate_taxon

json_taxonomic_affiliations = {'Buchnera': taxon_ids}
json_taxonomic_affiliations = disambiguate_taxon(json_taxonomic_affiliations, ncbi)
json_taxonomic_affiliations

{'Buchnera': OrderedDict([('cellular organisms', [131567]),
              ('Bacteria', [2]),
              ('Proteobacteria', [1224]),
              ('Gammaproteobacteria', [1236]),
              ('Enterobacterales', [91347]),
              ('Erwiniaceae', [1903409]),
              ('Buchnera', [32199]),
              ('Buchnera aphidicola', [9])])}

### Limit taxonomic rank used
Then if a rank_limit has been given, EsMeCaTa will shrink the taxonomic affiliations to keep only the taxon equal or below the specified taxonomic rank. By default, it is not applied so there will be no changed.

But as an example, we will try with a rank_limit on the `family`.

In [7]:
from esmecata.proteomes import filter_rank_limit

rank_limit = 'family'

json_taxonomic_affiliations_limited = filter_rank_limit(json_taxonomic_affiliations, ncbi, rank_limit)

json_taxonomic_affiliations_limited

{'Buchnera': OrderedDict([('Erwiniaceae', [1903409]),
              ('Buchnera', [32199]),
              ('Buchnera aphidicola', [9])])}

As you can see all the ranks above the family `Erwiniaceae` (all the ranks superior to `family`) have been removed from the taxonomic affliations.

This option is useful to avoid using high taxonomic ranks (containing too many organisms to make a valid estimation of the metabolism). For example if EsMeCaTa finds only taxon IDs at the `phylum` rank, using this option with a rank_limit of `order` will remove this taxonomic affiliation from the dataset as there is no taxonomic rank equal or below the taxonomic rank `order`.

### Search for proteomes in UniProt
Then we use the taxon IDs and the taxonomic affiliations to search for proteomes in UniProt.

EsMeCaTa searches for proteomes with a score of [BUSCO score](https://www.uniprot.org/help/assessing_proteomes) higher or equal to 80%. Furthermore, proteomes described by UniProt as [redundant](https://www.uniprot.org/help/proteome_redundancy_faq) or [excluded](https://www.uniprot.org/help/proteome_exclusion_reasons) will not be selected.

Also EsMeCaTa will first search for [reference](https://www.uniprot.org/help/reference_proteome) proteomes. If no reference proteome is available, it will then search in non-reference proteome.

In [6]:
from esmecata.proteomes import rest_query_proteomes

busco_percentage_keep = 80
all_proteomes = None

for observation_name in json_taxonomic_affiliations:
    proteomes_descriptions = []
    for tax_name in reversed(json_taxonomic_affiliations[observation_name]):
        tax_id = json_taxonomic_affiliations[observation_name][tax_name][0]
        proteomes, organism_ids, data_proteomes = rest_query_proteomes(observation_name, tax_id, tax_name, busco_percentage_keep, all_proteomes)
        print(tax_name, tax_id, proteomes)

Buchnera aphidicola 9 ['UP000000601', 'UP000001806']
Buchnera 32199 ['UP000000601', 'UP000001806']
Erwiniaceae 1903409 ['UP000000601', 'UP000001806', 'UP000294462', 'UP000296144', 'UP000001702', 'UP000001726', 'UP000008793', 'UP000016900', 'UP000019918', 'UP000028602', 'UP000029577', 'UP000033924', 'UP000037088', 'UP000050856', 'UP000059419', 'UP000192900', 'UP000229786', 'UP000242222', 'UP000244334', 'UP000273595', 'UP000288794', 'UP000424752', 'UP000429602', 'UP000435266', 'UP000441083', 'UP000443529', 'UP000505325']
Enterobacterales 91347 ['UP000000625', 'UP000000601', 'UP000001806', 'UP000007794', 'UP000031627', 'UP000033094', 'UP000033104', 'UP000061704', 'UP000069926', 'UP000294462', 'UP000296144', 'UP000000260', 'UP000000558', 'UP000000746', 'UP000000747', 'UP000000815', 'UP000001006', 'UP000001014', 'UP000001030', 'UP000001122', 'UP000001123', 'UP000001410', 'UP000001702', 'UP000001726', 'UP000001889', 'UP000001955', 'UP000002032', 'UP000002084', 'UP000002514', 'UP000002529', '

Proteobacteria 1224 ['UP000000625', 'UP000000579', 'UP000000601', 'UP000001806', 'UP000007794', 'UP000031627', 'UP000033094', 'UP000033104', 'UP000061704', 'UP000069926', 'UP000294462', 'UP000296144', 'UP000000231', 'UP000000233', 'UP000000238', 'UP000000239', 'UP000000245', 'UP000000247', 'UP000000248', 'UP000000260', 'UP000000270', 'UP000000321', 'UP000000329', 'UP000000361', 'UP000000366', 'UP000000374', 'UP000000383', 'UP000000422', 'UP000000425', 'UP000000429', 'UP000000430', 'UP000000442', 'UP000000466', 'UP000000483', 'UP000000494', 'UP000000535', 'UP000000537', 'UP000000546', 'UP000000556', 'UP000000558', 'UP000000577', 'UP000000584', 'UP000000593', 'UP000000602', 'UP000000605', 'UP000000609', 'UP000000636', 'UP000000639', 'UP000000643', 'UP000000644', 'UP000000647', 'UP000000684', 'UP000000686', 'UP000000692', 'UP000000739', 'UP000000746', 'UP000000747', 'UP000000756', 'UP000000760', 'UP000000784', 'UP000000799', 'UP000000809', 'UP000000813', 'UP000000815', 'UP000000930', 'UP0

Bacteria 2 ['UP000000625', 'UP000001570', 'UP000000579', 'UP000000601', 'UP000000807', 'UP000001584', 'UP000001806', 'UP000007794', 'UP000031627', 'UP000033094', 'UP000033104', 'UP000061704', 'UP000069926', 'UP000294462', 'UP000296144', 'UP000000212', 'UP000000231', 'UP000000233', 'UP000000235', 'UP000000238', 'UP000000239', 'UP000000245', 'UP000000247', 'UP000000248', 'UP000000260', 'UP000000268', 'UP000000269', 'UP000000270', 'UP000000272', 'UP000000276', 'UP000000310', 'UP000000321', 'UP000000322', 'UP000000323', 'UP000000329', 'UP000000332', 'UP000000333', 'UP000000343', 'UP000000361', 'UP000000366', 'UP000000370', 'UP000000374', 'UP000000376', 'UP000000377', 'UP000000378', 'UP000000379', 'UP000000383', 'UP000000417', 'UP000000422', 'UP000000423', 'UP000000425', 'UP000000428', 'UP000000429', 'UP000000430', 'UP000000431', 'UP000000432', 'UP000000439', 'UP000000440', 'UP000000442', 'UP000000449', 'UP000000466', 'UP000000467', 'UP000000482', 'UP000000483', 'UP000000485', 'UP000000492'

cellular organisms 131567 ['UP000000625', 'UP000002311', 'UP000000589', 'UP000000803', 'UP000001570', 'UP000002494', 'UP000005640', 'UP000006548', 'UP000009136', 'UP000000437', 'UP000001940', 'UP000002195', 'UP000059680', 'UP000002485', 'UP000000559', 'UP000000579', 'UP000000601', 'UP000000807', 'UP000001584', 'UP000001806', 'UP000007794', 'UP000031627', 'UP000033094', 'UP000033104', 'UP000061704', 'UP000069926', 'UP000294462', 'UP000296144', 'UP000000212', 'UP000000226', 'UP000000231', 'UP000000233', 'UP000000235', 'UP000000238', 'UP000000239', 'UP000000242', 'UP000000245', 'UP000000247', 'UP000000248', 'UP000000254', 'UP000000260', 'UP000000262', 'UP000000267', 'UP000000268', 'UP000000269', 'UP000000270', 'UP000000272', 'UP000000276', 'UP000000304', 'UP000000305', 'UP000000310', 'UP000000311', 'UP000000314', 'UP000000321', 'UP000000322', 'UP000000323', 'UP000000329', 'UP000000332', 'UP000000333', 'UP000000343', 'UP000000346', 'UP000000361', 'UP000000366', 'UP000000370', 'UP000000374'

In this example we retrieve 2 proteomes.
The search has no go higher than the genus as we have used the rank_limit in the previous cell code.

If the number of proteomes is superior to a limit, a subsampling will be performed.

### Subsampling procedures

We will used a bigger example to show how the subsampling works.

In this artificial example, we have searched for the genus Escherichia and have found 20 proteomes.
Among these 20 proteomes:
- 10 are associated to *Escherichia coli* (taxon ID 562).
- 5 to *Escherichia albertii* (taxon ID 208962).
- 4 to *Escherichia fergusonii* (taxon ID 564).
- 1 to *Escherichia alba* (taxon ID 2562891).

We have put a limit of proteomes at 10 (by default it is 100).

EsMeCaTa will subsample a number close to 10 but it will try to keep the taxonomic diversity observed.
So it will compute the percentage of contribution of each organism to the number of proteoems associated to Escherichia. This will give the following percentages:

- *Escherichia coli*: 10 proteomes / 20 = 50%.
- *Escherichia albertii*: 5 proteomes / 20 = 25%.
- *Escherichia fergusonii*: 4 proteomes / 20 = 20%.
- *Escherichia alba*: 1 proteome / 20 = 5%.

But there is another rule: even if an organism contributes to less than 1 percents, it must be present in the proteomes returns by EsMeCaTa. This rule and the round applied to the number of proteomes lead to variation in the number of proteomes returned (which can be different compared to the input limit_maximal_number_proteomes).


In [7]:
from esmecata.proteomes import subsampling_proteomes
from collections import Counter

organism_ids = {'2562891': ['UP000477739'], '208962': ['UP000292187', 'UP000193938', 'UP000650750', 'UP000407502', 'UP000003042'],
                '562': ['UP000000625', 'UP000000558', 'UP000464341', 'UP000219757', 'UP000092491',
                        'UP000567387', 'UP000234906', 'UP000016096', 'UP000017268', 'UP000017618'],
                '564': ['UP000510927', 'UP000392711', 'UP000000745', 'UP000033773']}
revert_organism_ids ={}
for org_id, proteomes in organism_ids.items():
    revert_organism_ids.update({proteome_id: org_id for proteome_id in proteomes})
limit_maximal_number_proteomes = 10

selected_proteomes = subsampling_proteomes(organism_ids, limit_maximal_number_proteomes, ncbi)
selected_organisms = [revert_organism_ids[proteome] for proteome in selected_proteomes]

for org_id in Counter(selected_organisms):
    print(org_id, Counter(selected_organisms)[org_id])

208962 3
2562891 1
562 5
564 2


So the subsampling selects 11 proteomes among the 20:

- 5 proteomes from *Escherichia coli* (taxon ID 562).
- 3 proteomes from *Escherichia albertii* (taxon ID 208962).
- 2 proteomes from *Escherichia fergusonii* (taxon ID 564).
- 1 proteome from *Escherichia alba* (taxon ID 2562891).

### Download proteomes

After these steps, EsMeCaTa will select the lowest taxonomic rank in the taxonomic affiliations associated with at least 1 proteomes.

In our example:

`'cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Buchnera;Buchnera aphidicola'`

The lowest taxonomic rank with a proteome is the species with `Buchnera aphidicola` and its 2 proteomes `['UP000001806', 'UP000000601']`.

Then the selected proteomes are downloaded.

## EsMeCaTa clustering

### Protein clusters with MMseqs2

Using these proteomes, EsMeCaTa will use [MMseqs2](https://github.com/soedinglab/MMseqs2) to cluster the proteins.

First, mmseqs with the command `createdb` creates a sequence database from the proteomes.

Then it performs the clustering using `cluster` with the following default options `--min-seq-id 0.3 -c 0.8`. These options search for a minimal sequence identity of 30 % and a coverage of 80%. The idea of these options is to be close to similar threshold for homology between sequences.

From the result of the clustering `result2profile` will create profiles. Using these profiles mmseqs `profile2consensus` will create consensus sequences. These consensus are put into a fasta file with `convert2fasta`.

A fasta file containing the representative sequences is also created with `convert2fasta`.

The results of the clustering are stored into a tsv file with `createtsv`.

The tsv file containing the results is as follow:

|         |         |         |
|---------|---------|---------|
|  P59503 | P59503  | P57412  |
|  Q89AJ9 | Q89AJ9  | P57386  |
|  Q89AL1 | Q89AL1  | P57369  |
|  Q89AE4 | Q89AE4  | P57473  |
|  Q89AY7 | Q89AY7  |         |

One row corresponds to one protein cluster created by mmseqs.

The first column corresponds to the name of the cluster (which is the ID of the representative protein of the cluster). Then the following column contains the protein of this cluster (the first ID is the ID of the repreentative protein of the cluster, that is why it occurs two times in a row).

Using this file, EsMeCaTa wil create the following dictionary:

In [9]:
protein_clusters = {'P59503': ['P59503', 'P57412'],
                    'Q89AJ9': ['Q89AJ9', 'P57386'],
                   'Q89AL1': ['Q89AL1', 'P57369'],
                   'Q89AE4': ['Q89AE4', 'P57473'],
                   'Q89AY7': ['Q89AY7']}

### Cluster filtering according to proteome representation

These clusters will be filtered by EsMecaTa according to the representation of each proteoems associated to a taxon in the cluster.

This is performed in the esmecata.clustering.filter_protein_cluster function.


In [10]:
organism_prots = {'P59503': 'UP000000601', 'P57412': 'UP000001806',
                'Q89AJ9': 'UP000000601', 'P57386': 'UP000001806',
                 'Q89AL1': 'UP000000601', 'P57369': 'UP000001806',
                 'Q89AE4': 'UP000000601', 'P57473': 'UP000001806',
                 'Q89AY7': 'UP000000601'}
number_proteomes = 2

computed_threshold_cluster = {}
for rep_protein in protein_clusters:
    proteomes_cluster = set([organism_prots[prot] for prot in protein_clusters[rep_protein]])
    print(rep_protein, proteomes_cluster)
    computed_threshold_cluster[rep_protein] = len(proteomes_cluster) / number_proteomes

computed_threshold_cluster

P59503 {'UP000001806', 'UP000000601'}
Q89AJ9 {'UP000001806', 'UP000000601'}
Q89AL1 {'UP000001806', 'UP000000601'}
Q89AE4 {'UP000001806', 'UP000000601'}
Q89AY7 {'UP000000601'}


{'P59503': 1.0, 'Q89AJ9': 1.0, 'Q89AL1': 1.0, 'Q89AE4': 1.0, 'Q89AY7': 0.5}

The cluster `P59503` contains 2 proteins (`P59503` from the proteome `UP000000601` and `P57412` from `UP000001806`). So it contains all the proteome given as input to MMseqs2.

Whereas the cluster `Q89AY7` contains only 1 protein (`Q89AY7`) from 1 proteome (`UP000000601`). So it cOntains only 50% of the proteome given to MMseqs2.

In [11]:
reference_threshold = 0.95
kept_clusters = []
for rep_protein in protein_clusters:
    if computed_threshold_cluster[rep_protein] >= reference_threshold:
        kept_clusters.append(rep_protein)
kept_clusters

['P59503', 'Q89AJ9', 'Q89AL1', 'Q89AE4']

Then a threshold is used to filter each clustering. By default it is 0.95 to select protein clusters containing protein from most of the proteomes.

So in the previous example, the cluster `Q89AY7` will be removed as it contains only a protein from one proteome.

If the threshold is put to 0, it will select all the clusters associated to the proteomes of the taxon.

The protein clusters kept are then written into different files and will be used during the next step.

## EsMeCaTa annotation

There are two methods to retrieve annotation.
A first one using UniProt annotations and a method to propagate the annotion in protein clusters.
And a second one using eggnog-mapper to annotate consensus sequences associated with protein clusters.

### Retrieve annotation from UniProt

Using the protein cluster kept during the previous step, EsMeCaTa will query UniProt to get the annotations of the proteins inside each cluster.

It will return:

- protein name
- if the protein is reviewed or not
- list of GO Terms
- list of Enzyme Commissions
- list of InterPro domains
- list of Rhea IDs
- gene name

In [12]:
from esmecata.annotation import query_uniprot_annotation_rest

protein_queries = ['Q89AL1', 'P57369']

output_dict = {}
output_dict = query_uniprot_annotation_rest(protein_queries, output_dict)
output_dict

{'Q89AL1': ['Ribosomal large subunit pseudouridine synthase B',
  True,
  ['GO:0003723', 'GO:0120159', 'GO:0000455'],
  ['5.4.99.22'],
  ['IPR018496',
   'IPR000748',
   'IPR036986',
   'IPR042092',
   'IPR002942',
   'IPR006145',
   'IPR020103',
   'IPR020094'],
  ['RHEA:42520'],
  'rluB'],
 'P57369': ['Ribosomal large subunit pseudouridine synthase B',
  True,
  ['GO:0003723', 'GO:0120159', 'GO:0000455'],
  ['5.4.99.22'],
  ['IPR018496',
   'IPR000748',
   'IPR036986',
   'IPR042092',
   'IPR002942',
   'IPR006145',
   'IPR020103',
   'IPR020094'],
  ['RHEA:42520'],
  'rluB']}

### Propagation of annotation in the cluster

Using annotation present in each proteins of the cluster, EsMeCaTa will propagate the annotation according to the propagation `-p` option. This will impact only the GO Terms and EC. For the protein function and the gene name, EsMeCaTa will select the protein function/gene name with the most occurrences in the protein cluster.

By default, `-p` is set to 1 meaning an annotation is kept if it is found in all the proteins of the cluster.

For example this is the case for the previous cell code. `Q89AL1` and `P57369` have both the same GO Terms and EC so the cluster will be associated to these annotations.

We will use another example to explain the other option. The following protein cluster contains 3 proteins with different annotation.

In [13]:
output_dict = {'prot_1': ['function_1', True, ['GO:0031522', 'GO:0004765'], ['7.4.2.8'], ['IPR027417'], [], 'gene_1'],
                'prot_2': ['function_2', True, ['GO:0031522', 'GO:0005737'], ['7.4.2.8', '2.7.1.71'], ['IPR027417'], [], 'gene_1'],
                 'prot_3': ['function_1', True, ['GO:0031522', 'GO:0005737'], ['7.4.2.8'], [], [], 'gene_3']}

By using the default propagation `-p 1` we keep annotations occurring in all proteins:

In [14]:
from esmecata.annotation import propagate_annotation_in_cluster

propagate_annotation = 1
reference_proteins = {'prot_1': ['prot_1', 'prot_2', 'prot_3']}
uniref_output_dict = None
    
protein_annotations = propagate_annotation_in_cluster(output_dict, reference_proteins, propagate_annotation, uniref_output_dict)

protein_annotations

{'prot_1': ['function_1', ['GO:0031522'], ['7.4.2.8'], 'gene_1']}

By using the propagation `-p 0.66` we keep annotations if they occur at least in 66 % proteins of the cluster (here 2 proteisn among the 3).

In this way, we add the GO Term `GO:0005737`, which is present in prot_1 and prot_2 but not in prot_3.

In [15]:
from esmecata.annotation import propagate_annotation_in_cluster

propagate_annotation = 0.66
reference_proteins = {'prot_1': ['prot_1', 'prot_2', 'prot_3']}
uniref_output_dict = None

protein_annotations = propagate_annotation_in_cluster(output_dict, reference_proteins, propagate_annotation, uniref_output_dict)

protein_annotations

{'prot_1': ['function_1', ['GO:0031522', 'GO:0005737'], ['7.4.2.8'], 'gene_1']}

Then the propagation `-p 0` will keep all the annotations present in the cluster.

This adds the GO Term `GO:0004765` (only present in prot_1) and the EC `2.7.1.71` (only present in prot_2).

In [16]:
from esmecata.annotation import propagate_annotation_in_cluster

propagate_annotation = 0
reference_proteins = {'prot_1': ['prot_1', 'prot_2', 'prot_3']}
uniref_output_dict = None
    
protein_annotations = propagate_annotation_in_cluster(output_dict, reference_proteins, propagate_annotation, uniref_output_dict)

protein_annotations

{'prot_1': ['function_1',
  ['GO:0004765', 'GO:0031522', 'GO:0005737'],
  ['2.7.1.71', '7.4.2.8'],
  'gene_1']}

Then these annotations are written into a tsv file and pathologic file (for Pathway Tools).

### Retrieve annotation using eggnog-mapper

Using the protein clusters kept during the previous step, EsMeCaTa will use eggnog-mapper to annotate the consensus protein sequences associated with the protein clusters. 

It will return:

- protein name
- list of GO Terms
- list of Enzyme Commissions