# Getting started with packages imports and git repo cloning

In [None]:
!git clone -b phi https://github.com/PhiCtl/isospec-internship/

In [None]:
!pip install "glycowork[ml]" nbdev glycowork

In [None]:
import warnings
warnings.filterwarnings("ignore")
from IPython.display import HTML
from nbdev.showdoc import show_doc

import numpy as np
import pandas as pd
import os
import copy
import matplotlib.pyplot as plt
import seaborn as sns

import torch

from glycowork.ml.models import *
from glycowork.ml.inference import *
from glycowork.ml.processing import *
from glycowork.ml.model_training import *
from glycowork.ml.train_test_split import *
from glycowork.glycan_data.loader import df_species, df_glycan, glycan_binding
from glycowork.glycan_data.loader import *

In [None]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from scipy.spatial.distance import cdist

In [None]:
import sys
sys.path.insert(0,'/content/isospec-internship/notebooks')

from embed_helpers import *

In [None]:
DATA_PATH = "/content/isospec-internship/data/glycan_embedding"

# Datasets

In [None]:
# Extracted glycans used for inference and enrichment
df_glycan_list = pd.read_csv(os.path.join(DATA_PATH, 'glycan_list.csv')).rename(columns={'tissue_species': 'Species'})
df_glycan_list['type'] = 'unknown'
df_glycan_list.head()

In [None]:
# Curated list for inference
df_glycan = df_glycan[~df_glycan['glycan'].isin(df_glycan_list['glycan'].to_list())] # Remove glycans of interest from the curated list...
df_glycan.shape

In [None]:
print(f"There are {df_glycan.explode('Species')['Species'].nunique()} unique species in the data set",
         f"and {df_glycan.explode('disease_association')['disease_association'].dropna().nunique()-1} diseases.")

In [None]:
print(f"There are {glycan_binding.shape[0]} different proteins and {glycan_binding.shape[1]} associated binding glycans.")
glycan_binding.head()

In [None]:
# Used to control representation in the embedding space by assessing closeness
df_N_glycans = pd.read_csv(os.path.join(DATA_PATH, 'N_glycans_df.csv'))
df_N_glycans['type'] = 'N_glycans'

# N glycans are present within df_glycan df
res = pd.merge(df_N_glycans, df_glycan, on='glycan', how='left', indicator =True)
res = res[res['_merge'] == 'left_only'].drop('_merge', axis=1)
print(res)

# 1 - Approaches reflexion

Getting inspiration from [Using graph convolutional neural networks to learn a representation for glycans](https://www.sciencedirect.com/science/article/pii/S2211124721006161#sec1) :
* Sweetnet is a graph convolutional network (GCNN) that provides a useful glycan representation using glycan sequences.
* Sweetnet was trained using _the task of predicting which species a given glycan sequence came from_, ie. a multiclass classification task for a better model performance comparison. The GCNN approach outperformed existing approaches on the classification task. See more about Sweetnet training [here](https://github.com/BojarLab/SweetNet/blob/main/SweetNet_code.ipynb).
* Since solving this task was successful, the authors infered that the hidden glycan representation learned by the network was meaningful, and demonstrated it on various downstream tasks.

The state of the art embedding for glycans from different species is already computed by Sweetnet graph. Using graph neural networks seems the best possible approach. Nevertheless, different approaches for embedding can be used:
* Take advantage of curated Sweetnet GCNN architecture for glycan sequence embedding, train the model on all glycans sequences with the task of predicting the species. ``` composition ``` and ```tissue_sample``` features could be leveraged for embedding space learning as well, but 1) ```composition``` is redundant with the glycan sequence and 2) ```tissue_sample``` might not exactly reflect glycan presence in tissues.
* Same task as above predicting the glycan type (eg. N, O, etc...).
* Explore another type of graph neural network architecture, namely graph transformers ([See HugginFace website](https://huggingface.co/docs/transformers/en/model_doc/graphormer#usage-tips)).



# 2- Using Sweetnet trained on a classification task

* Species classification : Multi class classification, ie. each glycan can belong to several species.

## 2.1 Data processing for species classification

In [None]:
df_data = df_glycan.explode('Species')
df_data = df_data.drop_duplicates(['Species', 'glycan'])

In [None]:
train_x, val_x, train_y, val_y, id_val, class_list, class_converter = hierarchy_filter(df_data,
                                                                              rank = 'Species', min_seq=10)

## 2.1 Data processing for glycan type classification

In [None]:
df_data = df_glycan[df_glycan['glycan_type'].isin(['free', 'lipid', 'N', 'O', 'repeat'])]
class_list = df_data['glycan_type'].unique().tolist()
class_converter = {class_list[i] : i for i in range(len(class_list))}

In [None]:
train_x, val_x, train_y, val_y = general_split(df_data['glycan'].values.tolist(),\
                                               df_data['glycan_type'].map(class_converter).values.tolist(), test_size=0.2)

In [None]:
print(f"{len(train_x)} glycans and {len(class_list)} classes")

## 2.2 Model training

### 2.2.1 Actually training the model

Model training for species classification achieved similar accuracy on the validation set than reported on the paper, ie. 0.44.

In [None]:
dataloaders = split_data_to_train(train_x, val_x, train_y, val_y)
model = prep_model('SweetNet', trained=False, num_classes=len(class_list))
optimizer_ft, scheduler, criterion = training_setup(model, 0.0005, num_classes = len(class_list))
model_ft_3 = train_model(model, dataloaders, criterion, optimizer_ft, scheduler,
                   num_epochs = 100)

In [None]:
#torch.save(model_ft_3, '/content/SweetNet_glycan_type.pt')
torch.save(model_ft_3, '/content/SweetNet_species_10.pt')

### 2.2.2 Retrieving trained model weights

In [None]:
# Change device if on GPU
model_ft_3 = torch.load('/content/isospec-internship/models/SweetNet_glycan_type.pt')
model_ft_3.eval()

## 2.3 Embedding visualisation

Glycan type : for the sake of visualisation, distinguish glycans used for embedding training from N-glycans and "discovered glycans".

In [None]:
cols = ['glycan', 'Species', 'Kingdom', 'glycan_type', 'disease_association', 'disease_species', 'tissue_sample', 'tissue_species', 'Composition']
glycans = df_data.merge(df_N_glycans[['glycan', 'type']], on='glycan', how='outer')\
                    .fillna({'type' : 'known'})\
                    [cols]

if not isinstance(glycans['Species'], list):
  glycans = glycans.groupby('glycan').agg({'Species':lambda ss : list(ss),
                                         'Kingdom' : lambda x : x,
                                         'glycan_type': lambda x : x,
                                         'disease_association': lambda x : x,
                                         'disease_species': lambda x : x,
                                         'tissue_sample': lambda x : x,
                                         'tissue_species': lambda x : x,
                                         'Composition': lambda x : x}).reset_index()

add_cols = glycans.columns.difference(df_glycan_list.columns)
for c in add_cols:
  df_glycan_list[c] = None
glycans = pd.concat([glycans, df_glycan_list]).reset_index(drop=True)
glycans['from_human'] = glycans['Species'].apply(lambda ss : 'Human glycan' if 'Homo_sapiens' in ss\
                                                 else 'Non human glycan')
df_learned_rep = glycans_to_emb(glycans['glycan'].values, model_ft_3)

In [None]:
# Get glycan main metadata
df_learned_rep_augm = df_learned_rep.merge(glycans, left_index=True, right_index=True)

### 2.3.1 Cluster visualisation

In [None]:
tsne_emb_1 = TSNE(random_state = 42).fit_transform(df_learned_rep_augm.drop([c for c in glycans.columns], axis=1))

In [None]:
tsne_emb_1 = pd.DataFrame(tsne_emb_1).merge(df_learned_rep_augm[[c for c in glycans.columns]], left_index=True, right_index=True)
tsne_emb_1.set_index('glycan', inplace=True)

In [None]:
# set visualization_hue = 'glycan_type'
plot_embedding_classes(tsne_emb_1, 't-SNE', visualization_hue='from_human', save_fig=True)

* Species classification
  * We can see from the cluster plot that N-linked glycans seem to cluster on the top right part of the graph, mostly on human glycan clusters. Our unknown glycans seem to belong to human glycans clusters, as expected from their tissue sample origin. Two of the unknown glycans with very similar sequence cluster very closely.
* glycan type classification : embeddings of the same glycan type cluster tightly on t-SNE plot. Our glycans of interest seem to belong to the N-glycan cluster.

PCA 1D clustering visualisation was not informative.

## 2.4 Enrich glycan information

The task is to find the most similar glycans to the unknown glycans of interest in the embedding space.

Several approaches can be considered:
* Finding the most relevant neighbours based on a suitable distance metric in the embedding space.
  * The most simple approach is to define a cosine distance threshold and keep all the datapoints below that threshold.
  * The second approach would be to find an inflection point in the distance curve and to define all the neighbours before that point as belonging to the cluster of interest. We will keep this second approach.
  * This stage would need validation with other methods to see whether we end up with semantically consistent clusters around the points of interest.
* Clustering the data.

I carefully chose the cosine distance or the Manhattan distance metrics as Euclidean distance metrics is not suited for high dimensions.


### 2.4.1 Finding relevant neighbours

In [None]:
# Retrieve high dimensional embeddings without metadata
df_reps = df_learned_rep_augm.set_index('glycan')\
          [[i for i in range(128)]]\
          .T

In [None]:
closest_ngb_cosine = plot_neighbours_distance(df_reps, df_glycan_list, max_dist=200, save_fig=True, smoothing_sigma=3)

In [None]:
closest_ngb_manh= plot_neighbours_distance(df_reps, df_glycan_list, metric='cityblock', save_fig=True, max_dist=200, smoothing_sigma=3)

### 2.4.2 Find clusters

https://www.datacamp.com/tutorial/dbscan-clustering-algorithm

#### TODO
- [ ] Based on meaningful distance, choose carefully epsilon and the minimum number of neighbours.

In [None]:
# All glycan data embeddings
X = df_reps.T

In [None]:
run_dbscan(X, tsne_emb_1[tsne_emb_1['from_human'] == 'Human glycan'], eps=0.1, min_samples=10)

### 2.4.3 Analyse neighbours

Print and save enriched list of discovered glycans :

In [None]:
metadata = ['Species', 'glycan_type',\
                      'disease_association', 'tissue_sample',\
                      'disease_species']

In [None]:
# Change gly_idx according to df_glycan_list glycans
gly_idx = 1
df_info = glycan_information(df_glycan_list['glycan'].values[gly_idx],
                       closest_ngb_cosine,
                       glycans, metadata,
                       glycan_binding,
                       closeness_level=2, print_agg=False)

To retrieve protein binding metadata, run the code and explore df_info.

We will define as neighbours all the points which are closer than the second inflection point (see above methods) to the glycan of interest. We will compare metadata gathered by embedding similarities between glycan type and species classification.

| Glycan    | Prop. of human derived neighbours  | Neighbouring glycans type  | Associated diseases (nb of neighbours) |
|-----------|-----------|-----------|-----------|
| Fuc(a1-?)GlcNAc(b1-2)Man(a1-6)[GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc | 3 / 13 | N-glycans | female breast cancer (1) |
| Neu5Ac(a2-?)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc | 7 / 22 | N-glycans | type_2_diabetes_mellitus (1),<br>oesophageal cancer (1),<br>colorectal cancer (2) |
| Neu5Ac(a2-6)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[Gal(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc | 15 / 39 | N-glycans | Toxoplasma_gondii_infection (2),<br>esophageal_cancer (1),<br>lung_non_small_cell_carcinoma (1),<br>type_2_diabetes_mellitus(1) |
| Neu5Ac(a2-6)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc | 11 / 25 | N-glycans | colorectal_cancer (3),<br>type_2_diabetes_mellitus (1),<br>esophageal_cancer (1) |
| Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[Gal(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc | 4 / 10 | N-glycans | None |

Metadata from



| Glycan    | Prop. of human derived neighbours  | Neighbouring glycans type  | Associated diseases  |
|-----------|-----------|-----------|-----------|
| Fuc(a1-?)GlcNAc(b1-2)Man(a1-6)[GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc | 4 / 10 | N-glycans | None |
| Neu5Ac(a2-?)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc | 12 / 20 | N-glycans | type_2_diabetes_mellitus (3),<br>benign_breast_tumor_tissues_vs_para_carcinoma_tissues (2),<br>filarial_elephantiasis(2),<br>lung_non_small_cell_carcinoma (2),<br>vernal_conjunctivitis (2),<br>diabetic_kidney_disease(1),<br>gastric_cancer(1) |
| Neu5Ac(a2-6)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[Gal(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc | 15 / 21 | N-glycans | Parkinson_disease(1),<br>cholangiocarcinoma (2),<br>influenza(2),<br>multiple_sclerosis(1),<br>pancreatic_cancer(2),<br>stomach_cancer(2),<br>thyroid_gland_papillary_carcinoma(2),<br>colorectal_cancer(3),<br>LPS_induced_inflammation(1),<br>staphyloenterotoxemia(1) |
| Neu5Ac(a2-6)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc | 17 / 26 | N-glycans | cholangiocarcinoma (2),<br>pancreatic_cancer (2),<br>stomach_cancer (2),<br>staphyloenterotoxemia (1),<br>LPS_induced_inflammation(1),<br>Parkinson_disease(1),<br>influenza (1),<br>multiple_sclerosis (1),<br>thyroid_gland_papillary_carcinoma (1),<br>cystic_fibrosis (1) |
| Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[Gal(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc | 7 / 14 | N-glycans | None |



