### About
To get to know glycowork I'll first get the pretrained sweetnet module working, then I'll build an evaluator to check its performance, this evaluator is meant to be iterated throughout the project until I can use it for the final evaluation where I compare base SweetNet to my GLM-infused Sweetnet. The testing data from the pretrained model will be used as a sanity chek once I set up my own sweetnet to iterate into the GLM-infuced variant (eventually I'll work with glycowork/ml/models.py)

 # Load Dependencies

In [3]:
import glycowork

  from .autonotebook import tqdm as notebook_tqdm


# Load pre-trained SweetNet

In [4]:
try:
    from glycowork.ml.models import prep_model
    print("Found prep_model in glycowork.ml!")
    help(prep_model)
except ImportError:
    print("Could not import prep_model directly from glycowork.ml.")
    # Optional: Explore the ml module further
    # import glycowork.ml
    # print(dir(glycowork.ml))

Found prep_model in glycowork.ml!
Help on function prep_model in module glycowork.ml.models:

prep_model(model_type: Literal['SweetNet', 'LectinOracle', 'LectinOracle_flex', 'NSequonPred'], num_classes: int, libr: Optional[Dict[str, int]] = None, trained: bool = False, hidden_dim: int = 128) -> torch.nn.modules.module.Module
    wrapper to instantiate model, initialize it, and put it on the GPU



That was a couple of hours of preamble to get that to work.

I set up a normal glycowork install in google colab to verify that I could get the function to work and almost gave up on having an editable local install but I set out to try just one more time and succeeded in installing what I need

I set up new fresh conda environment with better dependencies I am able to use the prep_model function

In [5]:
simple_sweetnet = prep_model("SweetNet", 3)
print(simple_sweetnet)

SweetNet(
  (conv1): GraphConv(128, 128)
  (conv2): GraphConv(128, 128)
  (conv3): GraphConv(128, 128)
  (item_embedding): Embedding(2566, 128)
  (lin1): Linear(in_features=128, out_features=1024, bias=True)
  (lin2): Linear(in_features=1024, out_features=128, bias=True)
  (lin3): Linear(in_features=128, out_features=3, bias=True)
  (bn1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (act1): LeakyReLU(negative_slope=0.01)
  (act2): LeakyReLU(negative_slope=0.01)
)


I have a pre-trained SweetNet to play around with, now I just need a way to test it. 

Time to build a first iteration of the evaluator

Since the pre-trained model is made for species prediction, lets do just that, when I have my own model I can predict for other properties. 

The evaluator function will get ported to an utils.py once I get the base system working

Before doing evaluation I need to build a data loading and splitting system using the tools available in glycowork


In [6]:
# Lets see if I have the dependencies installed to load the species-specific dataset
try:
    from glycowork.glycan_data.loader import df_species
    print("Found df_species!")    
    #help(df_species)
except ImportError:
    print("Could not import df_species")


Found df_species!


In [7]:
# let's explore the dataset a bit

print(dir(df_species))
print(df_species.info())
print(df_species.head())
print(df_species.columns)
print(df_species["Species"].unique())

['Class', 'Domain', 'Family', 'Genus', 'Kingdom', 'Order', 'Phylum', 'Species', 'T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__arrow_c_stream__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__dataframe_consortium_standard__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pandas_priority__', '__pos__', '_

In [8]:
# let's transform and split the dataset using the hierarchy_filter function
from glycowork.ml.train_test_split import hierarchy_filter

# the hierarhy_filter function keeps only the specified level of hierarchy
# it then splits the dataset into a training and validation set (use kingdom)
train_x, val_x, train_y, val_y, id_val, class_list, class_converter = hierarchy_filter(df_species, "Species")


In [9]:
# let's look at the results of the hierarchy_filter function
print(class_list)
print(class_converter)
print(val_x)
print(val_y)


['Abelmoschus_esculentus', 'Abelmoschus_glutinotextilis ', 'Abelmoschus_manihot', 'Abroma_augustum', 'Abrus_precatorius', 'Acanthamoeba_sp', 'Acanthocheilonema_viteae', 'Acer_pseudoplatanus', 'Acholeplasma_axanthum', 'Acidipropionibacterium_thoenii', 'Acidomonas_methanolica', 'Acinetobacter_baumannii', 'Acinetobacter_calcoaceticus', 'Acinetobacter_haemolyticus', 'Acinetobacter_lwoffii', 'Acinetobacter_radioresistens', 'Acinetobacter_sp', 'Acinonyx_jubatus', 'Acomys_russatus', 'Acremonium_sp', 'Actinidia_chinensis', 'Actinobacillus_pleuropneumoniae', 'Actinobacillus_suis', 'Actinoplanes_auranticolor', 'Actinoplanes_lobatus', 'Actinoplanes_utahensis', 'Actinopyga_mauritiana', 'Addax_nasomaculatus', 'Adeno-associated_dependoparvovirusA', 'Aedes_aegypti', 'Aeodes_ulvoidea', 'Aepyceros_melampus', 'Aeromonas_bestiarum', 'Aeromonas_caviae', 'Aeromonas_hydrophila', 'Aeromonas_salmonicida', 'Aesculus_hippocastanum', 'Agaricus_bisporus', 'Agaricus_blazei', 'Agaricus_campestris', 'Agaricus_subruf

In [10]:

# Let's load the glycan library and convert the dataset to glycan graphs
from glycowork.ml.processing import dataset_to_graphs
glycan_graphs_train = dataset_to_graphs(train_x, train_y)
glycan_graphs_val = dataset_to_graphs(val_x, val_y)

# let's just get the ten first graphs
glycan_test = glycan_graphs_val[0:10]

In [11]:
# let's see if I can get a prediction from the model

import torch

# --- Device Setup ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
simple_sweetnet = simple_sweetnet.to(device) # Move model to the chosen device

# --- Set Model to Evaluation Mode ---
simple_sweetnet.eval()
print("Model set to evaluation mode.")

# let's see if I can get a prediction from the model


Using device: cuda
Model set to evaluation mode.


In [12]:
torch.cuda.is_available()

True

[Data(edge_index=[2, 8], labels=[9], string_labels=[9], num_nodes=9, y=113),
 Data(edge_index=[2, 20], labels=[21], string_labels=[21], num_nodes=21, y=901),
 Data(edge_index=[2, 8], labels=[9], string_labels=[9], num_nodes=9, y=233),
 Data(edge_index=[2, 28], labels=[33], string_labels=[33], num_nodes=33, y=442),
 Data(edge_index=[2, 16], labels=[17], string_labels=[17], num_nodes=17, y=113),
 Data(edge_index=[2, 6], labels=[7], string_labels=[7], num_nodes=7, y=647),
 Data(edge_index=[2, 20], labels=[21], string_labels=[21], num_nodes=21, y=131),
 Data(edge_index=[2, 34], labels=[35], string_labels=[35], num_nodes=35, y=442),
 Data(edge_index=[2, 10], labels=[11], string_labels=[11], num_nodes=11, y=809),
 Data(edge_index=[2, 14], labels=[15], string_labels=[15], num_nodes=15, y=442)]

In [None]:
# --- Prediction loop ---

all_predictions = []
print("Starting prediction...")

# Disable gradient calculations
with torch.no_grad():

    # --- Processing items one-by-one ---
    for graph_data in glycan_test:
        if graph_data is None: continue # Skip if conversion failed

        # Move single data item to the device
        graph_data = graph_data.to(device)

        # Get model output (logits for the single item)
        # Note: Model might expect a batch, even of size 1. This depends on its forward method.
        # You might need to wrap `graph_data` in a list or use a batching mechanism.
        # Let's assume it handles single items for now, adjust if errors occur.
        try:
            output_logits = simple_sweetnet(graph_data) # Pass the prepared graph data

            # Process logits (assuming output is [1, num_classes])
            predicted_class_index = torch.argmax(output_logits, dim=1).cpu().item()
            all_predictions.append(predicted_class_index)
        except Exception as e:
            print(f"Error during prediction for one item: {e}")
            # Handle error appropriately, maybe append a placeholder like -1
            all_predictions.append(-1) # Append an error marker


# --- End of prediction loop ---

print(f"Finished prediction. Got {len(all_predictions)} predictions.")

Starting prediction...
Error during prediction for one item: SweetNet.forward() missing 2 required positional arguments: 'edge_index' and 'batch'
Error during prediction for one item: SweetNet.forward() missing 2 required positional arguments: 'edge_index' and 'batch'
Error during prediction for one item: SweetNet.forward() missing 2 required positional arguments: 'edge_index' and 'batch'
Error during prediction for one item: SweetNet.forward() missing 2 required positional arguments: 'edge_index' and 'batch'
Error during prediction for one item: SweetNet.forward() missing 2 required positional arguments: 'edge_index' and 'batch'
Error during prediction for one item: SweetNet.forward() missing 2 required positional arguments: 'edge_index' and 'batch'
Error during prediction for one item: SweetNet.forward() missing 2 required positional arguments: 'edge_index' and 'batch'
Error during prediction for one item: SweetNet.forward() missing 2 required positional arguments: 'edge_index' and '

In [16]:
help(simple_sweetnet.forward)

Help on method forward in module glycowork.ml.models:

forward(x: torch.Tensor, edge_index: torch.Tensor, batch: torch.Tensor, inference: bool = False) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] method of glycowork.ml.models.SweetNet instance
    Define the computation performed at every call.
    
    Should be overridden by all subclasses.
    
    .. note::
        Although the recipe for forward pass needs to be defined within
        this function, one should call the :class:`Module` instance afterwards
        instead of this since the former takes care of running the
        registered hooks while the latter silently ignores them.



In [25]:
# --- Prediction loop 1.2 ---
simple_sweetnet.eval()
all_predictions = []
all_true_labels_processed = [] # Store corresponding true labels

with torch.no_grad():
    # Assuming prepared_data_small is your list of PyG Data objects
    # Assuming y_true_small_filtered contains the corresponding true labels
    for i, data_item in enumerate(glycan_test):
        if data_item is None: continue

        # Move the *entire* Data object to the device first
        data_item = data_item.to(device)

        try:
            # --- Extract required arguments from the Data object ---
            # 1. Node Features (ASSUMING it's stored in .x - VERIFY THIS!)
            if not hasattr(data_item, 'x'):
                 print(f"Error: data_item at index {i} is missing '.x' attribute (node features). Skipping.")
                 continue
            node_features = data_item.x

            # 2. Edge Index
            edge_indices = data_item.edge_index

            # 3. Batch Vector (Create for single graph)
            # This tells the GNN layers which node belongs to which graph.
            # For a single graph, it's just a tensor of zeros.
            batch_vector = torch.zeros(data_item.num_nodes, dtype=torch.long, device=device)
            # --- End Extraction ---

            # --- Call model with separate, named arguments ---
            # Note: Verify the exact argument names required by model.forward using help() or inspect!
            #       It might just be model(node_features, edge_indices, batch_vector) without keywords.
            # --- Inside the prediction loop, before calling the model ---
            try:
                # Extract features (assuming data_item.x exists)
                node_features = data_item.x

                # *** Add these checks ***
                print(f"\n--- Debugging Item {i} ---")
                print(f"Node features type: {type(node_features)}")
                if node_features is not None:
                    print(f"Node features shape: {node_features.shape}")
                    print(f"Node features dtype: {node_features.dtype}")
                    print(f"Node features content (first 5): {node_features[:5]}")
                else:
                    print("Node features are None!")
                    # Optional: Raise an error here to stop immediately if None
                    # raise ValueError(f"Node features are None for item {i}")
                # *** End checks ***

                # Extract other args...
                edge_indices = data_item.edge_index
                batch_vector = torch.zeros(data_item.num_nodes, dtype=torch.long, device=device)

                # Call model
                output_logits = simple_sweetnet(x=node_features, edge_index=edge_indices, batch=batch_vector)

                # ... rest of the loop ...

            except AttributeError:
                print(f"Error: Item {i} might be missing '.x' attribute.")
                # Handle appropriately
            except Exception as e:
                print(f"Error during prediction setup or call for item {i}: {e}")
                # Handle appropriately

            output_logits = simple_sweetnet(x=node_features, edge_index=edge_indices, batch=batch_vector)

            # Process output
            predicted_class_index = torch.argmax(output_logits, dim=1).cpu().item()
            all_predictions.append(predicted_class_index)
            all_true_labels_processed.append(y_true_small_filtered[i]) # Add corresponding true label

        except Exception as e:
            print(f"Error during prediction for item {i}: {e}")
            # Decide how to handle errors, e.g., skip or append a placeholder
            # all_predictions.append(-1) # Error marker

# --- After the loop ---
# Now compare all_predictions with all_true_labels_processed using your evaluator
# evaluator = ModelEvaluator(...)
# results = evaluator.evaluate(np.array(all_true_labels_processed), np.array(all_predictions))
# print(results)


--- Debugging Item 0 ---
Node features type: <class 'NoneType'>
Node features are None!
Error during prediction setup or call for item 0: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType
Error during prediction for item 0: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType

--- Debugging Item 1 ---
Node features type: <class 'NoneType'>
Node features are None!
Error during prediction setup or call for item 1: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType
Error during prediction for item 1: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType

--- Debugging Item 2 ---
Node features type: <class 'NoneType'>
Node features are None!
Error during prediction setup or call for item 2: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType
Error during prediction for item 2: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType

--- Debugging Item 3 ---
Node 

# I'm abandoning this branch and Training my own SweetNet instead

Trying to figure out the black box of the prep_model SweetNet seems harder than just training my own model, which will get me an evaluation metric out of the box