### About
To get to know glycowork I'll first get the pretrained sweetnet module working, then I'll build an evaluator to check its performance, this evaluator is meant to be iterated throughout the project until I can use it for the final evaluation where I compare base SweetNet to my GLM-infused Sweetnet. The testing data from the pretrained model will be used as a sanity chek once I set up my own sweetnet to iterate into the GLM-infuced variant (eventually I'll work with glycowork/ml/models.py)

 # Load Dependencies

In [1]:
import glycowork

  from .autonotebook import tqdm as notebook_tqdm


# Load pre-trained SweetNet

In [3]:
try:
    from glycowork.ml.models import prep_model
    print("Found prep_model in glycowork.ml!")
    help(prep_model)
except ImportError:
    print("Could not import prep_model directly from glycowork.ml.")
    # Optional: Explore the ml module further
    # import glycowork.ml
    # print(dir(glycowork.ml))

Found prep_model in glycowork.ml!
Help on function prep_model in module glycowork.ml.models:

prep_model(model_type: Literal['SweetNet', 'LectinOracle', 'LectinOracle_flex', 'NSequonPred'], num_classes: int, libr: Optional[Dict[str, int]] = None, trained: bool = False, hidden_dim: int = 128) -> torch.nn.modules.module.Module
    wrapper to instantiate model, initialize it, and put it on the GPU



That was a couple of hours of preamble to get that to work.

I set up a normal glycowork install in google colab to verify that I could get the function to work and almost gave up on having an editable local install but I set out to try just one more time and succeeded in installing what I need

I set up new fresh conda environment with better dependencies I am able to use the prep_model function

In [8]:
simple_sweetnet = prep_model("SweetNet", 3)
print(simple_sweetnet)

SweetNet(
  (conv1): GraphConv(128, 128)
  (conv2): GraphConv(128, 128)
  (conv3): GraphConv(128, 128)
  (item_embedding): Embedding(2566, 128)
  (lin1): Linear(in_features=128, out_features=1024, bias=True)
  (lin2): Linear(in_features=1024, out_features=128, bias=True)
  (lin3): Linear(in_features=128, out_features=3, bias=True)
  (bn1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (act1): LeakyReLU(negative_slope=0.01)
  (act2): LeakyReLU(negative_slope=0.01)
)


I have a pre-trained SweetNet to play around with, now I just need a way to test it. 

Time to build a first iteration of the evaluator

Since the pre-trained model is made for species prediction, lets do just that, when I have my own model I can predict for other properties. 

The evaluator function will get ported to an utils.py once I get the base system working

Before doing evaluation I need to build a data loading and splitting system using the tools available in glycowork


In [21]:
# Lets see if I have the dependencies installed to load the species-specific dataset
try:
    from glycowork.glycan_data.loader import df_species
    print("Found df_species!")    
    #help(df_species)
except ImportError:
    print("Could not import df_species")


Found load_glycan_dataset in glycowork.datasets!


In [None]:
# let's explore the dataset a bit

print(dir(df_species))
print(df_species.info())
print(df_species.head())
print(df_species.columns)
print(df_species["Species"].unique())

['Class', 'Domain', 'Family', 'Genus', 'Kingdom', 'Order', 'Phylum', 'Species', 'T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__arrow_c_stream__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__dataframe_consortium_standard__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pandas_priority__', '__pos__', '_

In [59]:
# let's transform and split the dataset using the hierarchy_filter function
from glycowork.ml.train_test_split import hierarchy_filter

# the hierarhy_filter function keeps only the specified level of hierarchy
# it then splits the dataset into a training and validation set
train_x, val_x, train_y, val_y, id_val, class_list, class_converter = hierarchy_filter(df_species, "Species")


In [None]:
# let's look at the results of the hierarchy_filter function
print(class_list)
print(class_converter)
print(val_x)
print(val_y)


['Abelmoschus_esculentus', 'Abelmoschus_glutinotextilis ', 'Abelmoschus_manihot', 'Abroma_augustum', 'Abrus_precatorius', 'Acanthamoeba_sp', 'Acanthocheilonema_viteae', 'Acer_pseudoplatanus', 'Acholeplasma_axanthum', 'Acidipropionibacterium_thoenii', 'Acidomonas_methanolica', 'Acinetobacter_baumannii', 'Acinetobacter_calcoaceticus', 'Acinetobacter_haemolyticus', 'Acinetobacter_lwoffii', 'Acinetobacter_radioresistens', 'Acinetobacter_sp', 'Acinonyx_jubatus', 'Acomys_russatus', 'Acremonium_sp', 'Actinidia_chinensis', 'Actinobacillus_pleuropneumoniae', 'Actinobacillus_suis', 'Actinoplanes_auranticolor', 'Actinoplanes_lobatus', 'Actinoplanes_utahensis', 'Actinopyga_mauritiana', 'Addax_nasomaculatus', 'Adeno-associated_dependoparvovirusA', 'Aedes_aegypti', 'Aeodes_ulvoidea', 'Aepyceros_melampus', 'Aeromonas_bestiarum', 'Aeromonas_caviae', 'Aeromonas_hydrophila', 'Aeromonas_salmonicida', 'Aesculus_hippocastanum', 'Agaricus_bisporus', 'Agaricus_blazei', 'Agaricus_campestris', 'Agaricus_subruf

In [64]:

# Let's load the glycan library and convert the dataset to glycan graphs
from glycowork.ml.processing import dataset_to_graphs
glycan_graphs_train = dataset_to_graphs(train_x, train_y)
glycan_graphs_val = dataset_to_graphs(val_x, val_y)