# Creating the evaluation dataset

For a reproducible evaluation of the CAZyme prediction tools dbCAN, CUPP and eCAMI, a method for the automated and reproducible creation of a shareable dataset used for the evaluation is necessary. Additionally, for a fair and unbaised evaluation of the prediction tools, the dataset should incorporate an equal number of known non-CAZymes and known CAZymes, in order to not over represent one population over the other.

This notebook lays out how the script `get_evaluation_dataset.py` was developed, how it operates and how it was implemented in the evaluation of the CAZyme prediction tools.

**Required modules and packages**
The dataset is created from FASTA files containing multiple proteins. Preferrably, the FASTA files contain all protein sequences encoded within a given genomic assembly. For the evaluation these FASTA files were retrieved by first retrieving genomic assemblies from GenBank, then the genomic assemblies were parsed by `pyrewton.genbank.get_genbank_annotations.get_genbank_annotations.py`, which wrote out the annotated protein sequences to a single FASTA file. These FASTA files were then used as input for `get_evaluation_dataset.py`.

Owing to the requirement to known which proteins are CAZymes and which are non-CAZymes, the [cazy_webscraper](https://github.com/HobnobMancer/cazy_webscraper) was used to acquire a local copy of the CAZy database in SQL format. The local CAZy database was then queried to determine the proteins that encode non-CAZymes and CAZymes.


## Contents

- [Genomic assemblies](#genomic)
- [Imports](#import)
- [Parsing the input FASTA files](#input_fasta)
- [Getting CAZy classification](#cazy_classification)
- [Creating datasets](#dataset)


<a id="genomic"></a>

## Genomic assemblies

For the evaluation of the CAZyme prediction tools the following genomic assemblies were used to create the datasets:
**Assemblies used by the prediction tools evaluations by their developers**
- GCA_000143535.4 (Botrytis cinerea B05.10) (ascomycetes)
- GCA_003290485.1 (Malassezia restricta KCTC 27527 basidiomycetes)
- GCA_900637095.1 (Bifidobacterium bifidum NCTC13001) (high GC Gram +)
- GCA_000092285.1 (Caulobacter segnis ATCC 21756) (a-proteobacteria)
- GCA_000015865.1 (Hungateiclostridium thermocellum ATCC 27405) (firmicutes)
- GCA_00022325.1 (Caldicellulosiruptor bescii DSM 6725) (firmicutes)
- GCA_000146045.2 (Saccharomyces cerevisiae S288c) (baker's yeast)
- GCA_000975645.3 (Saccharomyces cerevisiae YJM1078) (baker's yeast)
- GCA_000007145.1 (Xanthomonas campestris pv campestris str ATCC 33913) (g-proteobacteria)
**Additional genomes to cover more fungi**

<a id="import"></a>

## Imports

In [8]:
pip install -m -U "scikit-learn==0.23.1"

import re

import pandas as pd

from Bio import SeqIO
from sklearn.model_selection import train_test_split
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from tqdm.notebook import tqdm

<a id="input_fasta"></a>

## Parsing the input FASTA files

The input FASTA files, containing the complete pool of potential the query protein sequences from which the dataset is selected, need to parsed to enable easier determination of which sequences encoding CAZymes and non-CAZymes, and then selecting a subset of proteins.

The input FASTA files are parsed, generating a dataframe that stores all the data from the FASTA file. The dataframe contains three columns:
- protein data (all data from the line startng '>')
- accession (GenBank or UniProt accession of the protein)
- sequence (amino acid sequence of the protein, stored as a single string with no new line characters)


In [4]:
# define a path to a FASTA file used for demonstrating the code
fasta_path = "genbank_proteins_txid73501_GCA_008080495_1.fasta"

# create dictionary to store data that will go in the database
protein_dict = {"protein_data": [], "sequence": []}

# Open and parse the FASTA file, populating the dictionary with its data
for seq_record in SeqIO.parse(fasta_path, "fasta"):
    protein_dict["protein_data"].append(seq_record.id)
    protein_dict["sequence"].append(seq_record.seq)

protein_dict

{'protein_data': ['ATY67253.1',
  'ATY67102.1',
  'ATY67103.1',
  'ATY67107.1',
  'ATY67108.1',
  'ATY67105.1',
  'ATY67106.1',
  'ATY67255.1',
  'ATY67256.1',
  'ATY67167.1',
  'ATY67166.1',
  'ATY67168.1',
  'ATY67482.1',
  'ATY67165.1',
  'ATY67164.1',
  'ATY67481.1',
  'ATY67479.1',
  'ATY67280.1',
  'ATY67282.1',
  'ATY67284.1',
  'ATY67286.1',
  'ATY67277.1',
  'ATY67072.1',
  'ATY67278.1',
  'ATY67279.1',
  'ATY67084.1',
  'ATY67085.1',
  'ATY67337.1',
  'ATY67333.1',
  'ATY67505.1',
  'ATY67504.1',
  'ATY67331.1',
  'ATY67330.1',
  'ATY67329.1',
  'ATY67502.1',
  'ATY67359.1',
  'ATY67358.1',
  'ATY67162.1',
  'ATY67163.1',
  'ATY67160.1',
  'ATY67161.1',
  'ATY67159.1',
  'ATY67344.1',
  'ATY67149.1',
  'ATY67312.1',
  'ATY67429.1',
  'ATY67360.1',
  'ATY67433.1',
  'ATY67432.1',
  'ATY67410.1',
  'ATY67434.1',
  'ATY67412.1',
  'ATY67411.1',
  'ATY67414.1',
  'ATY67413.1',
  'ATY67431.1',
  'ATY67430.1',
  'ATY67270.1',
  'ATY67210.1',
  'ATY67271.1',
  'ATY67213.1',
  'ATY67

Now we have a dictionary of protein data/sequences, we can create a dataframe of protein sequences.

In [6]:
# build a dataframe of protein sequences
protein_df = pd.DataFrame(protein_dict)
protein_df

Unnamed: 0,protein_data,sequence
0,ATY67253.1,"(M, S, T, S, L, A, S, S, Q, R, T, P, N, G, N, ..."
1,ATY67102.1,"(A, H, T, P, W, S, P, A, R, L, S, H, T, L, P, ..."
2,ATY67103.1,"(M, F, D, E, A, Q, K, S, I, S, L, A, D, Q, A, ..."
3,ATY67107.1,"(M, A, A, I, P, A, T, V, D, I, L, A, L, A, K, ..."
4,ATY67108.1,"(M, D, N, D, S, S, A, S, T, P, S, G, P, S, K, ..."
...,...,...
9282,ATY61645.1,"(M, S, S, P, L, F, S, V, P, I, G, P, C, G, K, ..."
9283,ATY61646.1,"(M, S, G, S, V, F, P, G, W, P, R, S, S, D, D, ..."
9284,ATY61647.1,"(M, P, T, L, R, S, V, L, T, Q, E, K, P, S, V, ..."
9285,ATY61648.1,"(M, G, K, P, A, S, F, L, H, I, P, L, N, T, A, ..."


<a id="cazy_classification"></a>

## Getting CAZy classification

To prevent an over representation of either the CAZyme or non-CAZyme population each FASTA file of protein sequences created by `get_evaluation_dataset.py` contains an equal number of CAZymes and non-CAZymes. Therefore, the CAZy classification of proteins as CAZymes and non-CAZymes needs to be incorporated in the protein dataframe to enable selecting equal numbers of CAZymes and non-CAZymes.

CAZy classified CAZymes are defined as proteins that CAZy has catalogued within its database. Known non-CAZymes are defined as proteins that are not catalogued within CAZy.

`pyrewton` can retrieve protein sequences from GenBank and from UniProt. CAZy retrieves protein sequences from GenBank. CAZy does include the UniProt accession numbers of some proteins but becusae CAZy and UniProt are independently expanding their dataset with occational synchronisations, there is a desynchronisation in the data. This results in a protein from GenBank appearing in both CAZy and Uniprot, however, the CAZy entry may not possess the UniProt accession numbers for its entry in UniProt. Therefore, t preferable to retrieve the GenBank accession of the proteins from the input FASTA file and query the local CAZy database with this, then if that yields nothing check if a UniProt accession was provided.

If the input FASTA file containing the protein sequences from which the final dataset is to be created are retrieved using `pyrewton` the protein data lines (marked with the '>' prefix) will contain the GenBank accession.

GenBank protein accessions have the generic format of:


[UniProt protein accession](#https://www.uniprot.org/help/accession_numbers) have the generic format of (where '.' represents no characters): 

|1|2|3|4|5|6|7|8|9|10|
|---|---|---|---|---|---|---|---|---|---|
|O,P,Q|0-9|A-Z,0-9|A-Z,0-9|A-Z,0-9|0-9|.|.|.|.|
|A-N,R-Z|0-9|A-Z|A-Z,0-9|A-Z,0-9|0-9|.|.|.|.|
|A-N,R-Z|0-9|A-Z|A-Z,0-9|A-Z,0-9|0-9|A-Z|A-Z,0-9|A-Z,0-9|0-9|

Owing to the generic/standardised formats of the protein accession numbers, regular expressions can be used to search for the protein accession number within the protein data.

In [None]:
# get open session to local CAZy database

# define path to local version of CAZy database
cazy_db_path = ""

# Use the declarative system
Base = declarative_base()
Session = sessionmaker()

# build databsae engine
try:
    engine = create_engine(f"sqlite+pysqlite:///{cazy_db_path}", echo=False)
    Base.metadata.create_all(engine)
    Session.configure(bind=engine)
except Exception:
    print(
            "Was unable to open local CAZy database. The causing error is presented below\n"
            "Cannot proceed without a valid local CAZy database.\n"
            "Terminating program."
    )

session = session()

First check that our methods of retrieving accessions is successful.

In [9]:
# retrieve the CAZy CAZyme/non-CAZyme classification for each protein in the dataframe
index = 0
for index in tqdm(
    range(len(protein_df["protein_data"])),
    desc="Adding CAZy classification to df",
):
    df_row = protein_df.iloc[index]
    
    # get GenBank accession for the protein sequence
    try:
        accession = re.search(r"\D{2}_\d{6}", df_row[0]).group()
        database = "GenBank"
    except AttributeError:
        try:
            accession = re.search(r"\D{2}\d{6}", df_row[0]).group()
            database = "GenBank"
        except AttributeError:
            try:
                accession = re.search(r"\D{3}\d{5}", df_row[0]).group()
                database = "GenBank"
            except AttributeError:
                accession = None

    if accession is None:
        # failed to retrieve GenBank accession, check for UniProt accession
        try:
            accession = re.search(r"\D\d(\D|\d)(\D|\d)(\D|\d)\d(\D|\d)(\D|\d)\d", df_row[0]).group()
            database = "UniProt"
        except AttributeError:
            try:
                accession = re.search(r"\D\d(\D|\d)(\D|\d)(\D|\d)\d", df_row[0]).group()
                database = "UniProt"
            except AttributeError:
                # log inability to retrieve accessions
                print(
                        f"WARNING--COULD NOT RETRIEVE PROTEIN ACCESSION NUMBER FOR:\n{df_row[0]}\n"
                        "Protein will not be included in pool of proteins used to create evaluation dataset"
                     )
                continue
    
    print("accession=", accession)


HBox(children=(HTML(value='Adding CAZy classification to df'), FloatProgress(value=0.0, max=9287.0), HTML(valu…

accession= ATY67253
accession= ATY67102
accession= ATY67103
accession= ATY67107
accession= ATY67108
accession= ATY67105
accession= ATY67106
accession= ATY67255
accession= ATY67256
accession= ATY67167
accession= ATY67166
accession= ATY67168
accession= ATY67482
accession= ATY67165
accession= ATY67164
accession= ATY67481
accession= ATY67479
accession= ATY67280
accession= ATY67282
accession= ATY67284
accession= ATY67286
accession= ATY67277
accession= ATY67072
accession= ATY67278
accession= ATY67279
accession= ATY67084
accession= ATY67085
accession= ATY67337
accession= ATY67333
accession= ATY67505
accession= ATY67504
accession= ATY67331
accession= ATY67330
accession= ATY67329
accession= ATY67502
accession= ATY67359
accession= ATY67358
accession= ATY67162
accession= ATY67163
accession= ATY67160
accession= ATY67161
accession= ATY67159
accession= ATY67344
accession= ATY67149
accession= ATY67312
accession= ATY67429
accession= ATY67360
accession= ATY67433
accession= ATY67432
accession= ATY67410


accession= ATY66554
accession= ATY66775
accession= ATY66540
accession= ATY66418
accession= ATY66419
accession= ATY66417
accession= ATY66669
accession= ATY66415
accession= ATY66416
accession= ATY67040
accession= ATY66421
accession= ATY66447
accession= ATY66446
accession= ATY66734
accession= ATY66448
accession= ATY66450
accession= ATY66449
accession= ATY66736
accession= ATY66735
accession= ATY66374
accession= ATY66373
accession= ATY66546
accession= ATY66547
accession= ATY66548
accession= ATY66549
accession= ATY66681
accession= ATY66543
accession= ATY66544
accession= ATY66545
accession= ATY66541
accession= ATY66542
accession= ATY66906
accession= ATY66905
accession= ATY66904
accession= ATY66903
accession= ATY66784
accession= ATY66902
accession= ATY66901
accession= ATY66900
accession= ATY66899
accession= ATY66898
accession= ATY66879
accession= ATY66878
accession= ATY66881
accession= ATY66880
accession= ATY66482
accession= ATY66481
accession= ATY66877
accession= ATY66876
accession= ATY66875


accession= ATY65916
accession= ATY65915
accession= ATY65922
accession= ATY65921
accession= ATY65920
accession= ATY65919
accession= ATY65914
accession= ATY65913
accession= ATY65376
accession= ATY65377
accession= ATY65378
accession= ATY65379
accession= ATY65380
accession= ATY65381
accession= ATY65599
accession= ATY65382
accession= ATY65374
accession= ATY65375
accession= ATY65940
accession= ATY65939
accession= ATY65942
accession= ATY65941
accession= ATY65944
accession= ATY65943
accession= ATY65946
accession= ATY65945
accession= ATY65948
accession= ATY65947
accession= ATY65564
accession= ATY65403
accession= ATY65400
accession= ATY65401
accession= ATY65398
accession= ATY65818
accession= ATY65396
accession= ATY65397
accession= ATY65406
accession= ATY65407
accession= ATY65725
accession= ATY65724
accession= ATY65723
accession= ATY65722
accession= ATY65721
accession= ATY65720
accession= ATY65719
accession= ATY65718
accession= ATY65717
accession= ATY66220
accession= ATY66221
accession= ATY66222


accession= ATY65410
accession= ATY65384
accession= ATY65990
accession= ATY65751
accession= ATY65383
accession= ATY65993
accession= ATY65994
accession= ATY65991
accession= ATY65992
accession= ATY65958
accession= ATY65959
accession= ATY66136
accession= ATY66135
accession= ATY66138
accession= ATY66137
accession= ATY66131
accession= ATY66130
accession= ATY66134
accession= ATY66132
accession= ATY65474
accession= ATY65473
accession= ATY66170
accession= ATY66172
accession= ATY65496
accession= ATY65681
accession= ATY65497
accession= ATY66196
accession= ATY65498
accession= ATY65499
accession= ATY66006
accession= ATY66007
accession= ATY66341
accession= ATY66340
accession= ATY65595
accession= ATY65980
accession= ATY65627
accession= ATY65626
accession= ATY66343
accession= ATY66342
accession= ATY65972
accession= ATY66339
accession= ATY65387
accession= ATY65388
accession= ATY65437
accession= ATY65386
accession= ATY65385
accession= ATY65426
accession= ATY65414
accession= ATY65415
accession= ATY66053


accession= ATY58400
accession= ATY59037
accession= ATY59036
accession= ATY58399
accession= ATY58398
accession= ATY59034
accession= ATY59033
accession= ATY58403
accession= ATY58402
accession= ATY59148
accession= ATY59146
accession= ATY59147
accession= ATY59151
accession= ATY59152
accession= ATY59149
accession= ATY59150
accession= ATY59145
accession= ATY58877
accession= ATY58335
accession= ATY58684
accession= ATY58683
accession= ATY58682
accession= ATY58334
accession= ATY58681
accession= ATY58333
accession= ATY58332
accession= ATY58331
accession= ATY58330
accession= ATY59082
accession= ATY59083
accession= ATY59084
accession= ATY59085
accession= ATY59079
accession= ATY58523
accession= ATY59080
accession= ATY59081
accession= ATY59077
accession= ATY59078
accession= ATY58658
accession= ATY58657
accession= ATY58660
accession= ATY58659
accession= ATY58662
accession= ATY58661
accession= ATY58664
accession= ATY58663
accession= ATY58666
accession= ATY58665
accession= ATY59401
accession= ATY59402


accession= ATY58616
accession= ATY58619
accession= ATY58618
accession= ATY59162
accession= ATY59163
accession= ATY59355
accession= ATY59161
accession= ATY59159
accession= ATY59157
accession= ATY59158
accession= ATY59352
accession= ATY59361
accession= ATY59362
accession= ATY58555
accession= ATY58554
accession= ATY58553
accession= ATY58552
accession= ATY58426
accession= ATY58425
accession= ATY58942
accession= ATY58556
accession= ATY58940
accession= ATY58939
accession= ATY58791
accession= ATY58792
accession= ATY58793
accession= ATY58794
accession= ATY58795
accession= ATY58796
accession= ATY58797
accession= ATY58798
accession= ATY59320
accession= ATY59321
accession= ATY58353
accession= ATY58352
accession= ATY58355
accession= ATY58354
accession= ATY58349
accession= ATY58348
accession= ATY58351
accession= ATY58350
accession= ATY58357
accession= ATY58356
accession= ATY59106
accession= ATY59107
accession= ATY59104
accession= ATY59105
accession= ATY59110
accession= ATY59111
accession= ATY59108


accession= ATY65258
accession= ATY64524
accession= ATY64523
accession= ATY64535
accession= ATY64533
accession= ATY64544
accession= ATY65083
accession= ATY64549
accession= ATY64545
accession= ATY64329
accession= ATY64328
accession= ATY63967
accession= ATY63941
accession= ATY63965
accession= ATY63966
accession= ATY63955
accession= ATY63964
accession= ATY63943
accession= ATY64170
accession= ATY63972
accession= ATY63973
accession= ATY64053
accession= ATY64052
accession= ATY64051
accession= ATY64050
accession= ATY64340
accession= ATY64049
accession= ATY64048
accession= ATY64047
accession= ATY64046
accession= ATY64045
accession= ATY64669
accession= ATY64670
accession= ATY64671
accession= ATY64672
accession= ATY64665
accession= ATY64666
accession= ATY64667
accession= ATY64668
accession= ATY64663
accession= ATY64664
accession= ATY65277
accession= ATY65278
accession= ATY65275
accession= ATY65276
accession= ATY64251
accession= ATY65274
accession= ATY65272
accession= ATY65273
accession= ATY63776


accession= ATY65270
accession= ATY64456
accession= ATY64446
accession= ATY64454
accession= ATY64444
accession= ATY64443
accession= ATY64451
accession= ATY64462
accession= ATY64461
accession= ATY63980
accession= ATY63981
accession= ATY63978
accession= ATY63979
accession= ATY63976
accession= ATY63977
accession= ATY63974
accession= ATY63975
accession= ATY63982
accession= ATY63983
accession= ATY64632
accession= ATY64631
accession= ATY64634
accession= ATY64633
accession= ATY64636
accession= ATY64635
accession= ATY64638
accession= ATY64637
accession= ATY64640
accession= ATY64639
accession= ATY63923
accession= ATY63924
accession= ATY63925
accession= ATY63926
accession= ATY63927
accession= ATY63928
accession= ATY63929
accession= ATY63930
accession= ATY63921
accession= ATY63922
accession= ATY64574
accession= ATY64573
accession= ATY64572
accession= ATY64571
accession= ATY64578
accession= ATY64577
accession= ATY64576
accession= ATY64575
accession= ATY64570
accession= ATY64569
accession= ATY65312


accession= ATY64191
accession= ATY64185
accession= ATY64186
accession= ATY64187
accession= ATY64188
accession= ATY64183
accession= ATY64184
accession= ATY65205
accession= ATY65204
accession= ATY65203
accession= ATY65202
accession= ATY65201
accession= ATY65200
accession= ATY65199
accession= ATY65198
accession= ATY65197
accession= ATY65196
accession= ATY64487
accession= ATY64488
accession= ATY64485
accession= ATY64486
accession= ATY65036
accession= ATY64660
accession= ATY64481
accession= ATY64482
accession= ATY64661
accession= ATY64662
accession= ATY63908
accession= ATY65117
accession= ATY63910
accession= ATY63909
accession= ATY65122
accession= ATY65121
accession= ATY65124
accession= ATY65123
accession= ATY63915
accession= ATY64648
accession= ATY64647
accession= ATY64646
accession= ATY64645
accession= ATY64644
accession= ATY64643
accession= ATY64642
accession= ATY64641
accession= ATY64650
accession= ATY64649
accession= ATY63988
accession= ATY63989
accession= ATY63990
accession= ATY63991


accession= ATY64985
accession= ATY64993
accession= ATY64994
accession= ATY64995
accession= ATY64996
accession= ATY64997
accession= ATY64998
accession= ATY64987
accession= ATY64999
accession= ATY64991
accession= ATY64992
accession= ATY64779
accession= ATY64778
accession= ATY64777
accession= ATY64776
accession= ATY64783
accession= ATY64782
accession= ATY64781
accession= ATY64780
accession= ATY64775
accession= ATY59879
accession= ATY60368
accession= ATY60369
accession= ATY60366
accession= ATY60367
accession= ATY60220
accession= ATY60221
accession= ATY60370
accession= ATY60371
accession= ATY60360
accession= ATY60213
accession= ATY60707
accession= ATY61166
accession= ATY59864
accession= ATY61168
accession= ATY60407
accession= ATY61162
accession= ATY61165
accession= ATY61164
accession= ATY61173
accession= ATY61172
accession= ATY61221
accession= ATY61220
accession= ATY61223
accession= ATY61222
accession= ATY61217
accession= ATY61216
accession= ATY61219
accession= ATY61218
accession= ATY61215


accession= ATY60531
accession= ATY60534
accession= ATY60533
accession= ATY60535
accession= ATY60691
accession= ATY59871
accession= ATY59872
accession= ATY59869
accession= ATY59870
accession= ATY60350
accession= ATY59868
accession= ATY60347
accession= ATY59866
accession= ATY59875
accession= ATY59876
accession= ATY60139
accession= ATY60138
accession= ATY60136
accession= ATY60134
accession= ATY60627
accession= ATY60626
accession= ATY60141
accession= ATY60140
accession= ATY60619
accession= ATY60618
accession= ATY59988
accession= ATY59990
accession= ATY59992
accession= ATY59993
accession= ATY59994
accession= ATY59995
accession= ATY59996
accession= ATY59823
accession= ATY59824
accession= ATY60587
accession= ATY60586
accession= ATY60589
accession= ATY60588
accession= ATY60583
accession= ATY60582
accession= ATY60585
accession= ATY60584
accession= ATY60581
accession= ATY60580
accession= ATY61321
accession= ATY61322
accession= ATY61319
accession= ATY61320
accession= ATY61325
accession= ATY61326


accession= ATY60198
accession= ATY60199
accession= ATY61021
accession= ATY61020
accession= ATY61023
accession= ATY61022
accession= ATY61025
accession= ATY61024
accession= ATY61027
accession= ATY61026
accession= ATY61018
accession= ATY61017
accession= ATY60283
accession= ATY60284
accession= ATY60285
accession= ATY60286
accession= ATY60279
accession= ATY60280
accession= ATY60281
accession= ATY60282
accession= ATY60288
accession= ATY60289
accession= ATY61090
accession= ATY61089
accession= ATY61088
accession= ATY61087
accession= ATY61086
accession= ATY61085
accession= ATY61084
accession= ATY61083
accession= ATY61092
accession= ATY61091
accession= ATY60452
accession= ATY60451
accession= ATY60454
accession= ATY60453
accession= ATY60448
accession= ATY60447
accession= ATY60450
accession= ATY60449
accession= ATY61208
accession= ATY60455
accession= ATY60446
accession= ATY60479
accession= ATY60475
accession= ATY60476
accession= ATY60484
accession= ATY60487
accession= ATY60480
accession= ATY60481


accession= ATY60628
accession= ATY60629
accession= ATY59839
accession= ATY59838
accession= ATY59837
accession= ATY59836
accession= ATY59843
accession= ATY59842
accession= ATY59841
accession= ATY59840
accession= ATY59845
accession= ATY59844
accession= ATY60592
accession= ATY60593
accession= ATY60594
accession= ATY60595
accession= ATY60596
accession= ATY60597
accession= ATY60598
accession= ATY60599
accession= ATY60600
accession= ATY60601
accession= ATY61348
accession= ATY60728
accession= ATY61350
accession= ATY61349
accession= ATY61338
accession= ATY61343
accession= ATY61346
accession= ATY61339
accession= ATY61353
accession= ATY61352
accession= ATY60544
accession= ATY60545
accession= ATY60543
accession= ATY62559
accession= ATY62562
accession= ATY62563
accession= ATY62560
accession= ATY62561
accession= ATY62557
accession= ATY62558
accession= ATY62882
accession= ATY63210
accession= ATY63209
accession= ATY63208
accession= ATY63207
accession= ATY63655
accession= ATY62881
accession= ATY63654


accession= ATY63368
accession= ATY63367
accession= ATY63366
accession= ATY63365
accession= ATY61399
accession= ATY61397
accession= ATY63672
accession= ATY63671
accession= ATY63370
accession= ATY63369
accession= ATY62568
accession= ATY62569
accession= ATY62566
accession= ATY62567
accession= ATY63658
accession= ATY62572
accession= ATY62570
accession= ATY62571
accession= ATY62573
accession= ATY62574
accession= ATY61373
accession= ATY61372
accession= ATY61375
accession= ATY61374
accession= ATY61369
accession= ATY61368
accession= ATY61371
accession= ATY61370
accession= ATY61367
accession= ATY61366
accession= ATY62239
accession= ATY62240
accession= ATY62241
accession= ATY62242
accession= ATY62243
accession= ATY62244
accession= ATY62245
accession= ATY62246
accession= ATY62247
accession= ATY62248
accession= ATY63493
accession= ATY63492
accession= ATY63491
accession= ATY63490
accession= ATY63497
accession= ATY63496
accession= ATY63495
accession= ATY63494
accession= ATY63499
accession= ATY63498


accession= ATY61552
accession= ATY61508
accession= ATY61511
accession= ATY61510
accession= ATY61540
accession= ATY61538
accession= ATY61541
accession= ATY61506
accession= ATY61515
accession= ATY61514
accession= ATY63542
accession= ATY63543
accession= ATY63540
accession= ATY63541
accession= ATY63546
accession= ATY63547
accession= ATY63544
accession= ATY63545
accession= ATY63538
accession= ATY63539
accession= ATY62393
accession= ATY62392
accession= ATY62395
accession= ATY62394
accession= ATY62389
accession= ATY62388
accession= ATY62391
accession= ATY62390
accession= ATY62397
accession= ATY62396
accession= ATY63439
accession= ATY63440
accession= ATY63441
accession= ATY63442
accession= ATY63435
accession= ATY63436
accession= ATY63437
accession= ATY63438
accession= ATY63433
accession= ATY63434
accession= ATY62297
accession= ATY62296
accession= ATY62295
accession= ATY62294
accession= ATY62293
accession= ATY62292
accession= ATY62291
accession= ATY62290
accession= ATY62289
accession= ATY62288


accession= ATY63267
accession= ATY63266
accession= ATY63269
accession= ATY63268
accession= ATY63271
accession= ATY63270
accession= ATY63273
accession= ATY63272
accession= ATY63265
accession= ATY63264
accession= ATY61967
accession= ATY61968
accession= ATY61969
accession= ATY61970
accession= ATY61971
accession= ATY61972
accession= ATY61973
accession= ATY61974
accession= ATY61975
accession= ATY61976
accession= ATY63174
accession= ATY63173
accession= ATY63172
accession= ATY63171
accession= ATY63178
accession= ATY63177
accession= ATY63176
accession= ATY63175
accession= ATY63180
accession= ATY63179
accession= ATY61882
accession= ATY61883
accession= ATY61880
accession= ATY61881
accession= ATY61886
accession= ATY61887
accession= ATY61884
accession= ATY61885
accession= ATY61878
accession= ATY61879
accession= ATY63099
accession= ATY63098
accession= ATY63101
accession= ATY63100
accession= ATY63095
accession= ATY63094
accession= ATY63097
accession= ATY63096
accession= ATY63103
accession= ATY63102


accession= ATY61396
accession= ATY61395
accession= ATY63488
accession= ATY63489
accession= ATY63476
accession= ATY63477
accession= ATY63474
accession= ATY63475
accession= ATY63471
accession= ATY63472
accession= ATY61354
accession= ATY61355
accession= ATY62147
accession= ATY62146
accession= ATY62099
accession= ATY63640
accession= ATY62126
accession= ATY62125
accession= ATY62123
accession= ATY62122
accession= ATY61443
accession= ATY61442
accession= ATY61851
accession= ATY61853
accession= ATY61854
accession= ATY61855
accession= ATY61840
accession= ATY62647
accession= ATY63349
accession= ATY61849
accession= ATY61866
accession= ATY61867
accession= ATY61949
accession= ATY61950
accession= ATY61947
accession= ATY61948
accession= ATY62542
accession= ATY62543
accession= ATY61945
accession= ATY61946
accession= ATY63734
accession= ATY62546
accession= ATY62514
accession= ATY63336
accession= ATY62516
accession= ATY62515
accession= ATY63379
accession= ATY62518
accession= ATY62540
accession= ATY62529


Add querying CAZy database to get the CAZy classification of proteins as CAZymes and non-CAZymes.

In [None]:
# retrieve the CAZy CAZyme/non-CAZyme classification for each protein in the dataframe
index = 0
for index in tqdm(
    range(len(protein_df["protein_data"])),
    desc="Adding CAZy classification to df",
):
    df_row = protein_df.iloc[index]
    
    # get GenBank accession for the protein sequence
    try:
        accession = re.search(r"\D{2}_\d{6}", df_row[0]).group()
        database = "GenBank"
    except AttributeError:
        try:
            accession = re.search(r"\D{2}\d{6}", df_row[0]).group()
            database = "GenBank"
        except AttributeError:
            try:
                accession = re.search(r"\D{3}\d{5}", df_row[0]).group()
                database = "GenBank"
            except AttributeError:
                accession = None

    if accession is None:
        # failed to retrieve GenBank accession, check for UniProt accession
        try:
            accession = re.search(r"\D\d(\D|\d)(\D|\d)(\D|\d)\d(\D|\d)(\D|\d)\d", df_row[0]).group()
            database = "UniProt"
        except AttributeError:
            try:
                accession = re.search(r"\D\d(\D|\d)(\D|\d)(\D|\d)\d", df_row[0]).group()
                database = "UniProt"
            except AttributeError:
                # log inability to retrieve accessions
                print(
                        f"WARNING--COULD NOT RETRIEVE PROTEIN ACCESSION NUMBER FOR:\n{df_row[0]}\n"
                        "Protein will not be included in pool of proteins used to create evaluation dataset"
                     )
                continue
    
    print("accession=", accession)

    # query local CAZy database to see if accession is included
    if database == "GenBank":
        query.session.query(Genbank).filter_by(genbank_accession=accession).all()
    else:
        query.session.query(Uniprot).filter_by(uniprot_accession=accession).all()

    # if protein is a CAZyme
    if len(query) != 0:
        cazyme_classification = 1
    
    # if protein is not a CAZyme
    else:
        cazyme_classification = 0
    
    # add CAZy classification to the protein dataframe
    try:
        protein_df.insert(3, "cazyme_classification", cazyme_classification)
    except ValueError:
        # log warning
        print(
            "Failed to insert CAZyme clasification into the protein dataframe for protein\n"
            f"{df_row[0]}\n"
            "Possibly the column is already present."
        )
        
protein_df

<a id="dataset"></a>

## Creating datasets

To retierate the FASTA files containing protein sequences analysed by the CAZyme prediction tools during the evaluation need equal numbers of CAZymes and non-CAZymes.

Also, it is better to also have equal number of proteins analysed per candidate species during the evaluation. Additionally, a broder a subset of CAZymes and non-CAZymes from each candidate species should be included in the evaluation.

One method to achieve this is to define the number of CAZymes and non-CAZymes, and create three testing sets from the `get_evaluation_dataset.py` input FASTA file containing all proteins from a candidate species.


In [None]:
# create three datasets from one FASTA file containing all proteins from a genomic assembly

for i in range(3):
    # define the number of CAZymes to be included in each dataset
    pool_size = 200 * 2   # 200 CAZymes + 200 non-CAZymes

    # define the x and y array for sklearn train_test_split()

    # x_array is a 2 dimensional arrays of inputs (protein sequences) for evaluation
    x_array = protein_df[["protein_data", "sequence"]].to_numpy()

    # y_array is a 1 dimensional array of the outputs (the CAZy classification (CAZyme/non-CAZyme classification))
    y_array = protein_df["cazyme_classification"].to_numpy()


    x_test, y_test = train_test_split(
        x_array,
        y_array,
        test_size=pool_size,
        stratify=y,
    )

    print("evaluation input=\n", x_array)
    print("cazyme classification of evaluation input=\n", y_array)
    
    i += 1
