You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
The key has expired.
Change Log
For Version 0.4.0
ml
models
added NSequonPred (for predicting whether N-linked sequons will be glycosylated) as a trained model
added LectinOracle_flex as a trained model (doing the same thing as LectinOracle but able to use raw protein sequences as input rather than ESM-1b representations; with comparable performance)
modified prep_model to allow for NSequonPred and LectinOracle_flex selection
added more model initialization options and adjusted their defaults in prep_model
model_training
changed default optimizer from AdamW to AdamW+SAM (Sharpness-Aware Minimization from https://arxiv.org/abs/2010.01412); typically increases model performance on test set by ~2%
implemented support for training models for multilabel classification
train_test_split
added taxonomic_multilabel to prepare taxonomic glycan data for multilabel classification
inference
added get_Nsequon_preds to use NSequonPred for inference
modified get_lectin_preds to allow for LectinOracle_flex usage
motif
graph
modified subgraph_isomorphism to use both string and precalculated graph inputs
modified subgraph_isomorphism to be able to count the number of occurring subgraphs
glycan_to_nxGraph now also records the actual monosaccharide/linkage strings as “string_labels” in the node labels
glycan_to_nxGraph and graph_to_string can now also operate on monosaccharides (glycans of length 1)
added largest_subgraph to identify the largest common subgraph between two glycans
annotate
annotate_glycan now makes use of precalculated graph in calling subgraph_isomorphism ~3x faster in motif annotation (also applies to many heatmap applications etc etc.)
annotate_glycan & annotate_dataset now also return the number of known/named motifs per glycan
replaced get_trisaccharides with get_k_saccharides that allows for motif recognition of user-defined size
bug fixes
tokenization
added constrain_prot and prot_to_coded to process protein sequences for LectinOracle_flex
added mask_rare_glycoletters to mask rare monosaccharides and linkages in glycan sequences
processing
check_nomenclature now returns True if no red flag is raised
glycan_data
replaced influenza_binding with the superset glycan_binding (564,647 protein-glycan interactions from 1,392 lectins)
loader
added a reindex utility function
updated linkages list
data_entry
check_presence now ensures correct glycan nomenclature
network
biosynthesis
added functions to consider post-translational glycan modifications when constructing biosynthetic networks (either via the process_ptm wrapper or as an option in construct_network)
added functionality to convert biosynthesis networks into directed graphs (either via the make_network_directed wrapper or as an option in construct_network)
added update_network to add new information to an already constructed biosynthetic network
improved construct_network to enable finding paths for all nodes that can be connected to the biosynthetic root nodes
added infuse_network to allow for visualizing glycomics abundance data together with biosynthetic networks
added choose_path to leverage biosynthetic networks from other species to determine which path is taken in diamond shapes (A->B, A->C, B->D, C->D) where both paths are virtual/not observed
various improvements to ensure that the code functionality also works for classes other than milk glycans, such as O-linked glycans
better network layouts with pydot2
added edge types (monosaccharide, monosaccharide+linkage, biosynthetic enzyme), which can be infused with differential gene expression information
bug fixes & smaller improvements (e.g., pruning of virtual leaves, exporting of networks, user choice of edge type, etc.)
evolution
added functions to calculate a distance matrix from glycan embeddings and use this to calculate dendrograms / evolutionary networks
add distance_from_metric to calculate distance of networks, e.g., via Jaccard distance