Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add hist module for plotting raw hdf5 files features distributions #261

Merged
merged 28 commits into from Jan 16, 2023

Conversation

gcroci2
Copy link
Collaborator

@gcroci2 gcroci2 commented Nov 29, 2022

Next step will be to implement the actual trasformations in the DeeprankDataset objects, see #237

@gcroci2 gcroci2 linked an issue Nov 29, 2022 that may be closed by this pull request
@gcroci2 gcroci2 self-assigned this Nov 29, 2022
@gcroci2 gcroci2 marked this pull request as draft November 29, 2022 08:24
@github-actions
Copy link

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale issue not touched from too much time label Dec 30, 2022
@gcroci2 gcroci2 removed the stale issue not touched from too much time label Dec 30, 2022
@gcroci2 gcroci2 marked this pull request as ready for review January 12, 2023 14:14
@gcroci2 gcroci2 changed the title feat: add transformation module for processing raw hdf5 files data feat: add transformation module for plotting raw hdf5 files data Jan 12, 2023
Copy link
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice! This is really useful to be able to see the values/distributions of the features. Next step will be actually implementing the transformation. Maybe this module should be renamed, because (at least for now) it is not actually transforming anything, just visualizing; would data_visualization.py be a good name?

Regarding my comment about _check_features: don't know whether it is worthwhile dealing with/fixing this now, depending on how we will proceed with using hdf5 files or not,
Whichever way you decide to go on it is fine, but if you don't change it now, maybe open a new issue about it so we don't forget?

deeprankcore/tools/transform.py Outdated Show resolved Hide resolved
deeprankcore/tools/transform.py Outdated Show resolved Hide resolved
deeprankcore/tools/transform.py Outdated Show resolved Hide resolved
@DaniBodor
Copy link
Collaborator

BTW, I haven't tested what the Dataframe and histogram outputs look like, but I assume you have and it looks as intended. Should I take a closer look at that as well?

@gcroci2
Copy link
Collaborator Author

gcroci2 commented Jan 13, 2023

Really nice! This is really useful to be able to see the values/distributions of the features. Next step will be actually implementing the transformation. Maybe this module should be renamed, because (at least for now) it is not actually transforming anything, just visualizing; would data_visualization.py be a good name?

I renamed transform.py to hist.py (it converts the hdf5 files to pandas for plotting and saving histograms), and the same for the test script. When we'll have a clearer pipeline for the transformations we can improve the naming again.

@gcroci2
Copy link
Collaborator Author

gcroci2 commented Jan 13, 2023

BTW, I haven't tested what the Dataframe and histogram outputs look like, but I assume you have and it looks as intended. Should I take a closer look at that as well?

I plotted the hists for all the features of the 140k datapoints and they look good, so don't worry :)

@DaniBodor
Copy link
Collaborator

Really nice! This is really useful to be able to see the values/distributions of the features. Next step will be actually implementing the transformation. Maybe this module should be renamed, because (at least for now) it is not actually transforming anything, just visualizing; would data_visualization.py be a good name?

I renamed transform.py to hist.py (it converts the hdf5 files to pandas for plotting and saving histograms), and the same for the test script. When we'll have a clearer pipeline for the transformations we can improve the naming again.

don't forget to rename the PR as well :)

@DaniBodor
Copy link
Collaborator

I think for the one-hot encoded values, it would be more useful to plot a single histogram that shows how many of each is present, rather than separate 0-1 histograms for each option.
This can also be a separate issue that we can tackle when we have time for it.

@gcroci2 gcroci2 changed the title feat: add transformation module for plotting raw hdf5 files data feat: add hist module for plotting raw hdf5 files features distributions Jan 13, 2023
@gcroci2
Copy link
Collaborator Author

gcroci2 commented Jan 13, 2023

I think for the one-hot encoded values, it would be more useful to plot a single histogram that shows how many of each is present, rather than separate 0-1 histograms for each option. This can also be a separate issue that we can tackle when we have time for it.

Opened in #317 :)

@gcroci2 gcroci2 merged commit ba69ca5 into main Jan 16, 2023
@gcroci2 gcroci2 deleted the 237_transformation_module_gcroci2 branch January 16, 2023 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants