Skip to content

v0.4.0

Choose a tag to compare

@Bribak Bribak released this 25 Feb 07:50
· 1065 commits to master since this release
544fde0

Change Log

For Version 0.4.0

ml

models

  • added NSequonPred (for predicting whether N-linked sequons will be glycosylated) as a trained model
  • added LectinOracle_flex as a trained model (doing the same thing as LectinOracle but able to use raw protein sequences as input rather than ESM-1b representations; with comparable performance)
  • modified prep_model to allow for NSequonPred and LectinOracle_flex selection
  • added more model initialization options and adjusted their defaults in prep_model

model_training

  • changed default optimizer from AdamW to AdamW+SAM (Sharpness-Aware Minimization from https://arxiv.org/abs/2010.01412); typically increases model performance on test set by ~2%
  • implemented support for training models for multilabel classification

train_test_split

  • added taxonomic_multilabel to prepare taxonomic glycan data for multilabel classification

inference

  • added get_Nsequon_preds to use NSequonPred for inference
  • modified get_lectin_preds to allow for LectinOracle_flex usage

motif

graph

  • modified subgraph_isomorphism to use both string and precalculated graph inputs
  • modified subgraph_isomorphism to be able to count the number of occurring subgraphs
  • glycan_to_nxGraph now also records the actual monosaccharide/linkage strings as “string_labels” in the node labels
  • glycan_to_nxGraph and graph_to_string can now also operate on monosaccharides (glycans of length 1)
  • added largest_subgraph to identify the largest common subgraph between two glycans

annotate

  • annotate_glycan now makes use of precalculated graph in calling subgraph_isomorphism  ~3x faster in motif annotation (also applies to many heatmap applications etc etc.)
  • annotate_glycan & annotate_dataset now also return the number of known/named motifs per glycan
  • replaced get_trisaccharides with get_k_saccharides that allows for motif recognition of user-defined size
  • bug fixes

tokenization

  • added constrain_prot and prot_to_coded to process protein sequences for LectinOracle_flex
  • added mask_rare_glycoletters to mask rare monosaccharides and linkages in glycan sequences

processing

  • check_nomenclature now returns True if no red flag is raised

glycan_data

  • replaced influenza_binding with the superset glycan_binding (564,647 protein-glycan interactions from 1,392 lectins)

loader

  • added a reindex utility function
  • updated linkages list

data_entry

  • check_presence now ensures correct glycan nomenclature

network

biosynthesis

  • added functions to consider post-translational glycan modifications when constructing biosynthetic networks (either via the process_ptm wrapper or as an option in construct_network)
  • added functionality to convert biosynthesis networks into directed graphs (either via the make_network_directed wrapper or as an option in construct_network)
  • added update_network to add new information to an already constructed biosynthetic network
  • improved construct_network to enable finding paths for all nodes that can be connected to the biosynthetic root nodes
  • added infuse_network to allow for visualizing glycomics abundance data together with biosynthetic networks
  • added choose_path to leverage biosynthetic networks from other species to determine which path is taken in diamond shapes (A->B, A->C, B->D, C->D) where both paths are virtual/not observed
  • various improvements to ensure that the code functionality also works for classes other than milk glycans, such as O-linked glycans
  • better network layouts with pydot2
  • added edge types (monosaccharide, monosaccharide+linkage, biosynthetic enzyme), which can be infused with differential gene expression information
  • bug fixes & smaller improvements (e.g., pruning of virtual leaves, exporting of networks, user choice of edge type, etc.)

evolution

  • added functions to calculate a distance matrix from glycan embeddings and use this to calculate dendrograms / evolutionary networks
  • add distance_from_metric to calculate distance of networks, e.g., via Jaccard distance