feat: add hist module for plotting raw hdf5 files features distributions #261

gcroci2 · 2022-11-29T08:22:54Z

Next step will be to implement the actual trasformations in the DeeprankDataset objects, see #237

…ormat

github-actions · 2022-12-30T03:24:50Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or this will be closed in 7 days.

DaniBodor

Really nice! This is really useful to be able to see the values/distributions of the features. Next step will be actually implementing the transformation. Maybe this module should be renamed, because (at least for now) it is not actually transforming anything, just visualizing; would data_visualization.py be a good name?

Regarding my comment about _check_features: don't know whether it is worthwhile dealing with/fixing this now, depending on how we will proceed with using hdf5 files or not,
Whichever way you decide to go on it is fine, but if you don't change it now, maybe open a new issue about it so we don't forget?

deeprankcore/tools/transform.py

DaniBodor · 2023-01-13T10:31:30Z

BTW, I haven't tested what the Dataframe and histogram outputs look like, but I assume you have and it looks as intended. Should I take a closer look at that as well?

gcroci2 · 2023-01-13T11:17:13Z

Really nice! This is really useful to be able to see the values/distributions of the features. Next step will be actually implementing the transformation. Maybe this module should be renamed, because (at least for now) it is not actually transforming anything, just visualizing; would data_visualization.py be a good name?

I renamed transform.py to hist.py (it converts the hdf5 files to pandas for plotting and saving histograms), and the same for the test script. When we'll have a clearer pipeline for the transformations we can improve the naming again.

gcroci2 · 2023-01-13T11:29:24Z

BTW, I haven't tested what the Dataframe and histogram outputs look like, but I assume you have and it looks as intended. Should I take a closer look at that as well?

I plotted the hists for all the features of the 140k datapoints and they look good, so don't worry :)

DaniBodor · 2023-01-13T12:08:53Z

Really nice! This is really useful to be able to see the values/distributions of the features. Next step will be actually implementing the transformation. Maybe this module should be renamed, because (at least for now) it is not actually transforming anything, just visualizing; would data_visualization.py be a good name?

I renamed transform.py to hist.py (it converts the hdf5 files to pandas for plotting and saving histograms), and the same for the test script. When we'll have a clearer pipeline for the transformations we can improve the naming again.

don't forget to rename the PR as well :)

DaniBodor · 2023-01-13T12:10:32Z

I think for the one-hot encoded values, it would be more useful to plot a single histogram that shows how many of each is present, rather than separate 0-1 histograms for each option.
This can also be a separate issue that we can tackle when we have time for it.

gcroci2 · 2023-01-13T17:04:25Z

I think for the one-hot encoded values, it would be more useful to plot a single histogram that shows how many of each is present, rather than separate 0-1 histograms for each option. This can also be a separate issue that we can tackle when we have time for it.

Opened in #317 :)

gcroci2 added 3 commits November 29, 2022 09:20

update minor in dataset.py doc string

bd2fe14

add notebook for development (to be deleted later)

60ab4e0

add first draft of transform module

5ce49a8

gcroci2 linked an issue Nov 29, 2022 that may be closed by this pull request

Create basic transformations module #237

Closed

gcroci2 self-assigned this Nov 29, 2022

gcroci2 marked this pull request as draft November 29, 2022 08:24

gcroci2 added 8 commits December 12, 2022 18:08

Merge branch 'main' into 237_transformation_module_gcroci2

9ce434c

Merge branch 'main' into 237_transformation_module_gcroci2

63a9b3c

add multiple hdf5 files option to hdf5_to_pandas

9d512da

modify how pandas df is built for save it in a feather/parquet file f…

4a14621

…ormat

add my comment to delete later

5737189

add reset index to pandas df in exporters

bf1b3d9

improve logic in hdf5_to_pandas

376804a

update development notebook

8981ee2

github-actions bot added the stale issue not touched from too much time label Dec 30, 2022

gcroci2 removed the stale issue not touched from too much time label Dec 30, 2022

gcroci2 and others added 11 commits January 3, 2023 10:32

add utility functions

71b0402

add dependencies

988524b

add tests for new functions in transform.py

9ba66ff

delete notebook

d54512d

improve plotting function

22223df

update tests

d499ddc

fix prospector errors

3401460

Merge branch 'main' into 237_transformation_module_gcroci2

24d2b1f

change plotly with matplot lib to handle big data

8583aca

update tests

ef57d57

remove warning too verbose

6877185

gcroci2 removed a link to an issue Jan 12, 2023

Create basic transformations module #237

Closed

remove plotly dependencies

3d32bee

merge with main

4bf8974

gcroci2 marked this pull request as ready for review January 12, 2023 14:14

gcroci2 changed the title ~~feat: add transformation module for processing raw hdf5 files data~~ feat: add transformation module for plotting raw hdf5 files data Jan 12, 2023

gcroci2 requested a review from DaniBodor January 12, 2023 14:15

DaniBodor approved these changes Jan 13, 2023

View reviewed changes

deeprankcore/tools/transform.py Outdated Show resolved Hide resolved

deeprankcore/tools/transform.py Outdated Show resolved Hide resolved

deeprankcore/tools/transform.py Outdated Show resolved Hide resolved

renaming scripts

b2107c5

gcroci2 added 2 commits January 13, 2023 12:20

uniform docstring to google style

713932e

add details to save_hist docstring

fbcd6c6

gcroci2 changed the title ~~feat: add transformation module for plotting raw hdf5 files data~~ feat: add hist module for plotting raw hdf5 files features distributions Jan 13, 2023

change name in tests for hist module

2d29bdc

gcroci2 merged commit ba69ca5 into main Jan 16, 2023

gcroci2 deleted the 237_transformation_module_gcroci2 branch January 16, 2023 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add hist module for plotting raw hdf5 files features distributions #261

feat: add hist module for plotting raw hdf5 files features distributions #261

gcroci2 commented Nov 29, 2022 •

edited

github-actions bot commented Dec 30, 2022

DaniBodor left a comment

DaniBodor commented Jan 13, 2023

gcroci2 commented Jan 13, 2023 •

edited

gcroci2 commented Jan 13, 2023

DaniBodor commented Jan 13, 2023

DaniBodor commented Jan 13, 2023

gcroci2 commented Jan 13, 2023

feat: add hist module for plotting raw hdf5 files features distributions #261

feat: add hist module for plotting raw hdf5 files features distributions #261

Conversation

gcroci2 commented Nov 29, 2022 • edited

github-actions bot commented Dec 30, 2022

DaniBodor left a comment

Choose a reason for hiding this comment

DaniBodor commented Jan 13, 2023

gcroci2 commented Jan 13, 2023 • edited

gcroci2 commented Jan 13, 2023

DaniBodor commented Jan 13, 2023

DaniBodor commented Jan 13, 2023

gcroci2 commented Jan 13, 2023

gcroci2 commented Nov 29, 2022 •

edited

gcroci2 commented Jan 13, 2023 •

edited