Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot again some features' distributions #141

Closed
4 tasks done
gcroci2 opened this issue Mar 14, 2023 · 7 comments
Closed
4 tasks done

Plot again some features' distributions #141

gcroci2 opened this issue Mar 14, 2023 · 7 comments
Assignees
Projects

Comments

@gcroci2
Copy link
Collaborator

gcroci2 commented Mar 14, 2023

When PRs #368, #333 and issue #346 will be merged/solved, and after having regenerated the hdf5 files (#140) replot the distributions of the features.

  • Replot features distributions in not one-hot encoded cases (2/3 features per image). Put particular attention (1 single plot per image) into:
    • vanderwalls and electric
  • Evaluate which feature distribution is not normal "enough"
  • For the features not normal "enough", evaluate which power transform makes more sense for making their distributions more normal (see here for ideas). Plot the results.
  • add distribution per target value (one plot for 0 and one plot for 1, for each feature)
@gcroci2 gcroci2 created this issue from a note in Development (To do) Mar 14, 2023
@gcroci2 gcroci2 added priority blocked Blocked by some other issue pMHC-I GNNs labels Mar 14, 2023
@gcroci2 gcroci2 removed the blocked Blocked by some other issue label Mar 27, 2023
@gcroci2 gcroci2 moved this from To do to In progress in Development Mar 30, 2023
@joyceljy
Copy link
Collaborator

joyceljy commented Mar 30, 2023

Features distributions overview

Data used in this issue are the ones described in issue #140 . More detailed description for the all the features and what they represent can be found here.

  • Node features
    • One-hot encoded: polarity (4 channels), res_type (20 channels)
    • Others: bsa, hb_acceptors, hb_donors, info_content, irc_negative_negative, irc_negative_positive, irc_nonpolar_negative, irc_nonpolar_nonpolar, irc_nonpolar_polar, irc_nonpolar_positive, irc_polar_negative, irc_polar_polar, irc_polar_positive, irc_positive_positive, irc_total, res_charge, res_depth, res_mass, res_pI, res_size, sasa, hse (3 channels), pssm (20 channels)
  • Edge features
    • One hot-encoded: covalent, same_chain
    • Others: distance, electrostatic, vanderwaals

Features distributions for non one-hot encoded cases

Features were placed in groups of images except for vanderwalls and electric which were plotted individually.
Replotted images can be found here.
except_onehotencoded.zip

Evaluating the feature distributions

Note: one-hot encoded features (polarity, res_type, covalent, same_chain) and pssm were excluded for now from the analysis.

  • We won't normalize nor standardize one-hot encoded features. For motivation, see for example this thread (this is the common opinion among the community).
  • In general, the rule of thumb for deciding when applying transformation before standardization is to have a distribution that widespreads the features values, ideally resembling a Gaussian, but not mandatorily (for example electrostatic and vanderwaals cube version is better than the original one).
  • Features to which we won't apply log, but standardization directly: res_size, res_charge, hb_donors, hb_acceptors, hse, irc_ features, res_mass, res_pI, distance
  • Features to which we'll apply log(x+1): res_depth, bsa, info_content
  • Features to which we'll apply square root: sasa
  • Features to which we'll apply cube root: electrostatic, vanderwaals
  • We'll remove for now pssm since it's not correctly computed

TODOs
Considering all features distributions except for res_size, res_charge, hb_donors, hb_acceptors, hse (they're already fine in their original distributions), do:

  • Only for features with values > 0 (no 0s, no negative values)
    • Replot the distributions using log(x)
    • Replot the distributions using log(x+1)
    • Replot the distributions using np.sqrt(x)
  • Only for features with values >= 0 (there are 0s, but no negative values)
    • Replot the distributions using log(x+1)
    • Replot the distributions using np.sqrt(x)
  • Only for features with values ><= 0 (there are 0s, negative and positive values)
    • Replot the distributions using log(exp(x)+1) (we were still getting infinite values so we gave up on this)
    • Replot the distributions using cubic root
  • Only for bsa and sasa, try to plot log(log(x+1)+1)

Features considered as Gaussian-like distribution (skewed)

  • electrostatic
Original Yeo-Johnson Cube root Binary
image image image image
  • res_size
Original Yeo-Johnson Log Binary
image image image image
  • res_charge
Original Yeo-Johnson Log Binary
image image image image
  • vanderwaals
Original Zoomed Yeo-Johnson Cube root Binary
image image image image image

Features considered as Gaussian-like distribution (exponential)

  • bsa
Original Yeo-Johnson Square Log(x+1) Log(log(x+1)+1) Binary
image image image image image image
  • hb_donors
Original Yeo-Johnson Log Binary
image image image image
  • irc_nonpolar_negative
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_nonpolar_nonpolar
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_nonpolar_polar
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_nonpolar_positive
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_polar_polar
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_polar_positive
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_total
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_negative_positive
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_positive_positive
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_polar_negative
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • irc_negative_negative
Original Yeo-Johnson Square Log(x+1) Binary
image image image image image
  • res_depth
Original Yeo-Johnson Log Log(x+1) Square Binary
image image image image image image
  • sasa
Original Yeo-Johnson Square Log(x+1) Log(log(x+1)+1) Binary
image image image image image image

Features that are not Gaussian-like distribution

  • hb_acceptors
Original Binary
image image
  • hse
Original Binary(hse_0) Binary(hse_1) Binary(hse_2)
image image image image
  • info_content
Original Log(x+1) Square Binary
image image image image
  • res_mass
Original Log Log(x+1) Square Binary
image image image image image
  • res_pI
Original Log Log(x+1) Square Binary
image image image image image
  • distance
Original Log Log(x+1) Square Binary
image image image image image

@gcroci2 gcroci2 added the meeting To be discussed during the weekly meeting label Apr 5, 2023
@LilySnow
Copy link
Collaborator

LilySnow commented Apr 5, 2023

The Binary panel always have the same mean and std for binders and non-binders... is it a bug?

@gcroci2
Copy link
Collaborator Author

gcroci2 commented Apr 6, 2023

The Binary panel always have the same mean and std for binders and non-binders... is it a bug?

Good catch! We'll check it and in case we'll post the right plots. The distributions are right though, so in case only the mean and the std values will change.

@gcroci2
Copy link
Collaborator Author

gcroci2 commented Apr 12, 2023

Now means and std dev for the binary plots are correct @LilySnow

@joyceljy
Copy link
Collaborator

Conclusion:
Features suitable for log(x) transformation:
None

Features suitable for Yeo-Johnson transformation:
res_size,

Features suitable for log(x) transformation:
res_charge, res_depth

Features suitable for log(x+1) transformation:
bsa,

Features suitable for Square root transformation:
sasa

Features suitable for Cube root transformation:
electrostatic, vanderwaals,

Original Distribution is already good:
res_size, res_charge, hb_donors, hb_acceptors, hse, irc_nonpolar_negative, irc_nonpolar_nonpolar, irc_nonpolar_polar, irc_nonpolar_positive, irc_polar_polar, irc_polar_positive, irc_total, irc_negative_positive, irc_positive_positive, irc_polar_negative, irc_negative_negative

@gcroci2
Copy link
Collaborator Author

gcroci2 commented Apr 18, 2023

Conclusion: Features suitable for log(x) transformation: None

Features suitable for Yeo-Johnson transformation: res_size,

We'll keep the original distribution for res_size

Features suitable for log(x) transformation: res_charge, res_depth

We'll keep the original distribution for res_charge, while I think that log(x+1) works better for res_depth

Features suitable for log(x+1) transformation: bsa,

We'll try log(log(x+1)+1) for it

Features suitable for Square root transformation: sasa

We'll try log(log(x+1)+1) for it as well

Features suitable for Cube root transformation: electrostatic, vanderwaals,

Agree

Original Distribution is already good: res_size, res_charge, hb_donors, hb_acceptors, hse, irc_nonpolar_negative, irc_nonpolar_nonpolar, irc_nonpolar_polar, irc_nonpolar_positive, irc_polar_polar, irc_polar_positive, irc_total, irc_negative_positive, irc_positive_positive, irc_polar_negative, irc_negative_negative

Agree

@joyceljy
Copy link
Collaborator

joyceljy commented Apr 18, 2023

Update:
I think doing log(log(x+1)) for bsa and sasa doesn't seem to have a lot of change to distribution. For sasa, log(log(x+1)) even make it having another high peak. For bsa, only the x-axis range changed. So I will suggest to keep bsa using log(x+1) and sasa using square root transformation.

Conclusion:
Features suitable for log(x) transformation:
None

Features suitable for Yeo-Johnson transformation:
None

Features suitable for log(x+1) transformation:
bsa, res_depth

Features suitable for Square root transformation:
sasa

Features suitable for Cube root transformation:
electrostatic, vanderwaals,

Original Distribution is already good:
res_size, res_charge, hb_donors, hb_acceptors, hse, irc_nonpolar_negative, irc_nonpolar_nonpolar, irc_nonpolar_polar, irc_nonpolar_positive, irc_polar_polar, irc_polar_positive, irc_total, irc_negative_positive, irc_positive_positive, irc_polar_negative, irc_negative_negative

@gcroci2 gcroci2 removed the priority label Apr 26, 2023
@gcroci2 gcroci2 removed the meeting To be discussed during the weekly meeting label May 31, 2023
@gcroci2 gcroci2 closed this as completed May 31, 2023
Development automation moved this from In progress to Done May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants