Shuffling+clustering round (best params) using classification #134

gcroci2 · 2023-02-22T15:51:28Z

After having determined the influence of standardization (#124), weighted loss function (#126), batch size (#127), and batch normalization (#131), experiment with clusters, using the choices that gave better results for shuffled data.
Common features among the following experiments:

Data used are in data/pMHCI/features_output_folder/GNN/residue/230329/ (generated in Regenerate hdf5 files for GNNs #140 )
- Residue-level queries
Model is a slightly improved naive GNN (PMHCI_Network01 in src/4_train_models/GNN/I/classification/struct/pmhc_gnn.py)
- Added a convolutional and a linear layers
Standardization applied to all features
Weighted loss function
Batch size 64
pssm feature removed
Cross entropy loss
Adam optimizer
Number of epochs increased (70 epochs, min_epoch 45, earlystop_patience 20, earlystop_maxgap 0.06)

Experiments:

Plot the best model obtained by using shuffled data, and use such parameters in all the below experiments
Clustering on peptides' clusters (cluster 3 is testset, see pMHCI data exploration - BA quantitative and = only #113 for more details)
Clustering on alleles (cluster 1 is testset, see pMHCI data exploration - BA quantitative and = only #113 for more details)
Only to confirm it doesn't work, clustering on alleles (HLA-C is test)
Compare sequence-based model and best CNN as baselines
Show improvements made from the very beginning (no standardization, no other parameters improvement), only for GNNs

The text was updated successfully, but these errors were encountered:

gcroci2 · 2023-03-17T12:17:22Z

Best configuration experiments

Comparisons

GNN model and configuration is the one used in exp_100k_pssm_rm_std_bs64_net1bn_[...], see comments below for more details about them
MHCFlurry refers to the state-of-the-art sequence-based model retrained with our data following their guidelines
MLP refers to a simple sequence-based mlp network we created to have an additional comparison
CNN_ and GNN_ refer to our structural-based models, which we want to demonstrate that generalize better than sequence-based
- We picked the best CNN and GNN models developed so far

CNNs and GNNs seem to work quite similarly
The difference between CNNs and GNNs and sequence-based methods are more evident with clustered data, in particular with alleles clusters
- This may mean that CNNs and GNNs, so structure-based models, can generalize better than sequence based models

More in detail - GNNs

Legenda for experiments' names

100k: number of data points
std: standardization on all features is applied
classw: weighted loss function is used
bs: batch size
pssm_rm: pssm feature is removed
net1/net1bn: this network is used, either without (net1) or with (net1bn) batch normalization applied after the Linear layers. If this term is not present, this naive gnn was used.

Shuffled data

Stratification on target (0: 56%, 1: 44%)
~ 10% testing set, 70% training set, 20% validation set
exp_name exp_100k_std_bs16_0
exp_name exp_100k_std_classw_bs64_net1_0
exp_name exp_100k_std_classw_bs64_net1bn_0
exp_name exp_100k_pssm_rm_std_classw_bs64_net1_0
exp_name exp_100k_pssm_rm_std_classw_bs64_net1bn_0
exp_name exp_100k_pssm_rm_std_bs64_net1bn_0

Removing pssm feature decreases a bit the performances, even if not significantly
Same when batch normalization is applied

Clustering on peptides

clustered data on cl_peptide Dataset (cluster_set_10 of /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_all_hla_gibbs_clusters.csv)
- clusters with value 3 are assigned to the test set [%]
- the rest has been shuffled between training and validation, stratifying on target
exp_name exp_100k_std_bs16_cl_peptide_0
exp_name exp_100k_std_classw_bs64_net1_cl_peptide_0
exp_name exp_100k_std_classw_bs64_net1bn_cl_peptide_0
exp_name exp_100k_pssm_rm_std_classw_bs64_net1_cl_peptide_0
exp_name exp_100k_pssm_rm_std_classw_bs64_net1bn_cl_peptide_0
exp_name exp_100k_pssm_rm_std_bs64_net1bn_cl_peptide_0

pssm seems less critical then for shuffled data
batch normalization contribution is not so clear

Clustering on (stratified) alleles

clustered data on cl_allele Dataset (allele_clustering of /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_only_eq_alleleclusters_pseudoseq.csv)
- clusters with value 1 are assigned to the test set [%]
- the rest has been shuffled between training and validation, stratifying on target
exp_name exp_100k_std_bs16_cl_allele_0
exp_name exp_100k_std_classw_bs64_net1_cl_allele_0
exp_name exp_100k_std_classw_bs64_net1bn_cl_allele_0
exp_name exp_100k_pssm_rm_std_classw_bs64_net1_cl_allele_0
exp_name exp_100k_pssm_rm_std_classw_bs64_net1bn_cl_allele_0
exp_name exp_100k_pssm_rm_std_bs64_net1bn_cl_allele_0

pssm removal seems to make the network learn more
batch normalization seems to help
the very first configuration without weighted loss, batch normalization, smaller batch size and naive GNN seems to perform as well as the "improved" configuration

Clustering on alleles

clustered data on allele_type Dataset (A, B, C, E)
- clusters with value C are assigned to the test set
- the rest has been shuffled between training and validation, stratifying on target
exp_name exp_100k_pssm_rm_allele_C_std_classw_gpu_nw16_0

Why is it performing so badly? Testing set: 1218 samples, 1%, containing cluster C only (A, B and E are in the training+validation set only).

Class 0: 349 samples, 29%
Class 1: 869 samples, 71%

It's evident that the network is not learning the physics behind the binding of pMHC complexes. Also, the test set is too small to be considered a fair evaluation of the network and contains a proportion of 1s and 0s very different from the training (56/44 vs 29/71).

Conclusions

CNNs and GNNs seem to work quite similarly, and the difference with sequence-based methods are more evident with clustered data, in particular with alleles clusters
- This may mean that CNNs and GNNs, so structure-based models, can generalize better than sequence based models
- Since results on testing on allele C are still not satisfying at all, we can deduce that the model is still learning the data only and not the physics behind the affinity interactions

gcroci2 created this issue from a note in Development (To do) Feb 22, 2023

gcroci2 added priority pMHC-I GNNs labels Feb 22, 2023

gcroci2 mentioned this issue Feb 22, 2023

Data round for GNNs using regression #125

Closed

5 tasks

gcroci2 moved this from To do to In progress in Development Mar 17, 2023

gcroci2 self-assigned this Mar 17, 2023

gcroci2 added meeting To be discussed during the weekly meeting and removed meeting To be discussed during the weekly meeting labels Mar 20, 2023

gcroci2 added blocked Blocked by some other issue and removed meeting To be discussed during the weekly meeting labels Mar 27, 2023

gcroci2 moved this from In progress to To do in Development Mar 27, 2023

gcroci2 removed the blocked Blocked by some other issue label Mar 30, 2023

gcroci2 moved this from To do to In progress in Development Mar 30, 2023

gcroci2 changed the title ~~Clusters' round for GNNs using classification~~ Shuffle+clustering round (best params) using classification Mar 30, 2023

gcroci2 changed the title ~~Shuffle+clustering round (best params) using classification~~ Shuffling+clustering round (best params) using classification Mar 30, 2023

gcroci2 removed the priority label Apr 17, 2023

gcroci2 moved this from In progress to Experiments done in Development Apr 17, 2023

gcroci2 closed this as completed Apr 17, 2023

Development automation moved this from Experiments done to Done Apr 17, 2023

gcroci2 moved this from Done to Experiments done in Development Apr 17, 2023

gcroci2 mentioned this issue Apr 17, 2023

Evaluate features importance #90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffling+clustering round (best params) using classification #134

Shuffling+clustering round (best params) using classification #134

gcroci2 commented Feb 22, 2023 •

edited

gcroci2 commented Mar 17, 2023 •

edited

Shuffling+clustering round (best params) using classification #134

Shuffling+clustering round (best params) using classification #134

Comments

gcroci2 commented Feb 22, 2023 • edited

gcroci2 commented Mar 17, 2023 • edited

Best configuration experiments

Comparisons

More in detail - GNNs

Legenda for experiments' names

Shuffled data

Clustering on peptides

Clustering on (stratified) alleles

Clustering on alleles

Conclusions

gcroci2 commented Feb 22, 2023 •

edited

gcroci2 commented Mar 17, 2023 •

edited