New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add hist module for plotting raw hdf5 files features distributions #261
Conversation
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice! This is really useful to be able to see the values/distributions of the features. Next step will be actually implementing the transformation. Maybe this module should be renamed, because (at least for now) it is not actually transforming anything, just visualizing; would data_visualization.py be a good name?
Regarding my comment about _check_features
: don't know whether it is worthwhile dealing with/fixing this now, depending on how we will proceed with using hdf5 files or not,
Whichever way you decide to go on it is fine, but if you don't change it now, maybe open a new issue about it so we don't forget?
BTW, I haven't tested what the Dataframe and histogram outputs look like, but I assume you have and it looks as intended. Should I take a closer look at that as well? |
I renamed transform.py to hist.py (it converts the hdf5 files to pandas for plotting and saving histograms), and the same for the test script. When we'll have a clearer pipeline for the transformations we can improve the naming again. |
I plotted the hists for all the features of the 140k datapoints and they look good, so don't worry :) |
don't forget to rename the PR as well :) |
I think for the one-hot encoded values, it would be more useful to plot a single histogram that shows how many of each is present, rather than separate 0-1 histograms for each option. |
Opened in #317 :) |
Next step will be to implement the actual trasformations in the DeeprankDataset objects, see #237