Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffling+clustering round (best params) using classification #134

Closed
6 tasks done
gcroci2 opened this issue Feb 22, 2023 · 1 comment
Closed
6 tasks done

Shuffling+clustering round (best params) using classification #134

gcroci2 opened this issue Feb 22, 2023 · 1 comment
Assignees
Projects

Comments

@gcroci2
Copy link
Collaborator

gcroci2 commented Feb 22, 2023

After having determined the influence of standardization (#124), weighted loss function (#126), batch size (#127), and batch normalization (#131), experiment with clusters, using the choices that gave better results for shuffled data.
Common features among the following experiments:

  • Data used are in data/pMHCI/features_output_folder/GNN/residue/230329/ (generated in Regenerate hdf5 files for GNNs #140 )
    • Residue-level queries
  • Model is a slightly improved naive GNN (PMHCI_Network01 in src/4_train_models/GNN/I/classification/struct/pmhc_gnn.py)
    • Added a convolutional and a linear layers
  • Standardization applied to all features
  • Weighted loss function
  • Batch size 64
  • pssm feature removed
  • Cross entropy loss
  • Adam optimizer
  • Number of epochs increased (70 epochs, min_epoch 45, earlystop_patience 20, earlystop_maxgap 0.06)

Experiments:

  • Plot the best model obtained by using shuffled data, and use such parameters in all the below experiments
  • Clustering on peptides' clusters (cluster 3 is testset, see pMHCI data exploration - BA quantitative and = only #113 for more details)
  • Clustering on alleles (cluster 1 is testset, see pMHCI data exploration - BA quantitative and = only #113 for more details)
  • Only to confirm it doesn't work, clustering on alleles (HLA-C is test)
  • Compare sequence-based model and best CNN as baselines
  • Show improvements made from the very beginning (no standardization, no other parameters improvement), only for GNNs
@gcroci2 gcroci2 created this issue from a note in Development (To do) Feb 22, 2023
@gcroci2 gcroci2 moved this from To do to In progress in Development Mar 17, 2023
@gcroci2 gcroci2 self-assigned this Mar 17, 2023
@gcroci2
Copy link
Collaborator Author

gcroci2 commented Mar 17, 2023

Best configuration experiments

Comparisons

  • GNN model and configuration is the one used in exp_100k_pssm_rm_std_bs64_net1bn_[...], see comments below for more details about them
  • MHCFlurry refers to the state-of-the-art sequence-based model retrained with our data following their guidelines
  • MLP refers to a simple sequence-based mlp network we created to have an additional comparison
  • CNN_ and GNN_ refer to our structural-based models, which we want to demonstrate that generalize better than sequence-based
    • We picked the best CNN and GNN models developed so far

image

image

image

  • CNNs and GNNs seem to work quite similarly
  • The difference between CNNs and GNNs and sequence-based methods are more evident with clustered data, in particular with alleles clusters
    • This may mean that CNNs and GNNs, so structure-based models, can generalize better than sequence based models

More in detail - GNNs

Legenda for experiments' names

100k: number of data points
std: standardization on all features is applied
classw: weighted loss function is used
bs: batch size
pssm_rm: pssm feature is removed
net1/net1bn: this network is used, either without (net1) or with (net1bn) batch normalization applied after the Linear layers. If this term is not present, this naive gnn was used.

Shuffled data

  • Stratification on target (0: 56%, 1: 44%)
  • ~ 10% testing set, 70% training set, 20% validation set
  • exp_name exp_100k_std_bs16_0
  • exp_name exp_100k_std_classw_bs64_net1_0
  • exp_name exp_100k_std_classw_bs64_net1bn_0
  • exp_name exp_100k_pssm_rm_std_classw_bs64_net1_0
  • exp_name exp_100k_pssm_rm_std_classw_bs64_net1bn_0
  • exp_name exp_100k_pssm_rm_std_bs64_net1bn_0

image

  • Removing pssm feature decreases a bit the performances, even if not significantly
  • Same when batch normalization is applied

Clustering on peptides

  • clustered data on cl_peptide Dataset (cluster_set_10 of /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_all_hla_gibbs_clusters.csv)
    • clusters with value 3 are assigned to the test set [%]
    • the rest has been shuffled between training and validation, stratifying on target
  • exp_name exp_100k_std_bs16_cl_peptide_0
  • exp_name exp_100k_std_classw_bs64_net1_cl_peptide_0
  • exp_name exp_100k_std_classw_bs64_net1bn_cl_peptide_0
  • exp_name exp_100k_pssm_rm_std_classw_bs64_net1_cl_peptide_0
  • exp_name exp_100k_pssm_rm_std_classw_bs64_net1bn_cl_peptide_0
  • exp_name exp_100k_pssm_rm_std_bs64_net1bn_cl_peptide_0

image

  • pssm seems less critical then for shuffled data
  • batch normalization contribution is not so clear

Clustering on (stratified) alleles

  • clustered data on cl_allele Dataset (allele_clustering of /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_only_eq_alleleclusters_pseudoseq.csv)
    • clusters with value 1 are assigned to the test set [%]
    • the rest has been shuffled between training and validation, stratifying on target
  • exp_name exp_100k_std_bs16_cl_allele_0
  • exp_name exp_100k_std_classw_bs64_net1_cl_allele_0
  • exp_name exp_100k_std_classw_bs64_net1bn_cl_allele_0
  • exp_name exp_100k_pssm_rm_std_classw_bs64_net1_cl_allele_0
  • exp_name exp_100k_pssm_rm_std_classw_bs64_net1bn_cl_allele_0
  • exp_name exp_100k_pssm_rm_std_bs64_net1bn_cl_allele_0

image

  • pssm removal seems to make the network learn more
  • batch normalization seems to help
  • the very first configuration without weighted loss, batch normalization, smaller batch size and naive GNN seems to perform as well as the "improved" configuration

Clustering on alleles

  • clustered data on allele_type Dataset (A, B, C, E)
    • clusters with value C are assigned to the test set
    • the rest has been shuffled between training and validation, stratifying on target
  • exp_name exp_100k_pssm_rm_allele_C_std_classw_gpu_nw16_0

image

Why is it performing so badly? Testing set: 1218 samples, 1%, containing cluster C only (A, B and E are in the training+validation set only).

  • Class 0: 349 samples, 29%
  • Class 1: 869 samples, 71%

It's evident that the network is not learning the physics behind the binding of pMHC complexes. Also, the test set is too small to be considered a fair evaluation of the network and contains a proportion of 1s and 0s very different from the training (56/44 vs 29/71).

image

Conclusions

  • CNNs and GNNs seem to work quite similarly, and the difference with sequence-based methods are more evident with clustered data, in particular with alleles clusters
    • This may mean that CNNs and GNNs, so structure-based models, can generalize better than sequence based models
    • Since results on testing on allele C are still not satisfying at all, we can deduce that the model is still learning the data only and not the physics behind the affinity interactions

@gcroci2 gcroci2 added meeting To be discussed during the weekly meeting and removed meeting To be discussed during the weekly meeting labels Mar 20, 2023
@gcroci2 gcroci2 added blocked Blocked by some other issue and removed meeting To be discussed during the weekly meeting labels Mar 27, 2023
@gcroci2 gcroci2 moved this from In progress to To do in Development Mar 27, 2023
@gcroci2 gcroci2 removed the blocked Blocked by some other issue label Mar 30, 2023
@gcroci2 gcroci2 moved this from To do to In progress in Development Mar 30, 2023
@gcroci2 gcroci2 changed the title Clusters' round for GNNs using classification Shuffle+clustering round (best params) using classification Mar 30, 2023
@gcroci2 gcroci2 changed the title Shuffle+clustering round (best params) using classification Shuffling+clustering round (best params) using classification Mar 30, 2023
@gcroci2 gcroci2 removed the priority label Apr 17, 2023
@gcroci2 gcroci2 moved this from In progress to Experiments done in Development Apr 17, 2023
@gcroci2 gcroci2 closed this as completed Apr 17, 2023
Development automation moved this from Experiments done to Done Apr 17, 2023
@gcroci2 gcroci2 moved this from Done to Experiments done in Development Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development
Experiments done
Development

No branches or pull requests

1 participant