Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ehr transformer ICU example #245

Merged
merged 53 commits into from Jan 13, 2023
Merged

Ehr transformer ICU example #245

merged 53 commits into from Jan 13, 2023

Conversation

ellabarkan
Copy link
Collaborator

@ellabarkan ellabarkan commented Jan 8, 2023

.

Copy link
Collaborator

@mosheraboh mosheraboh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ella.
I've reviewed the files you mentioned (didn't get into all the details though).
Looks good, please see few comments inline

from fuse.utils.ndict import NDict


class OpReadDataframeCinC(OpBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use OpReadDataFrame from fuse as is? What do you think?
If we should process the dataframe, we can do it before we pass is to OpReadDataframe

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your suggestion, I moved the creation of the patients dataframe to the "loading stage" and using OpReadDataFrame now.

from ops_read_cinc import OpReadDataframeCinC
from fuse.data.utils.export import ExportDataset

SOURCE = r"C:/D_Drive/Projects/EHR_Transformer/PhysioNet/predicting-mortality-of-icu-patients-the-physionetcomputing-in-cardiology-challenge-2012-1.0.0/predicting-mortality-of-icu-patients-the-physionet-computing-in-cardiology-challenge-2012-1.0.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to read the path from env variable and provide instructions on how to download the data

Copy link
Collaborator Author

@ellabarkan ellabarkan Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated configuration file to read from the environment variable and the main in dataset.py



class OpAddBMI(OpBase):
def __call__(self, sample_dict) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant to all ops: if you think that the op might be useful for other datasets as well (and also for readability).
It's better to also get the input keys and output keys. in this case key_in_height, key_in_weight, key_out_bmi.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added parameters of input and output keys

d_static = sample_dict["StaticDetails"]

if ("Height" in d_static.keys()) & ("Weight" in d_static.keys()):
height = d_static["Height"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect specific units, maybe we should specify it in call comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added information in the comment , please check if it in the right place


d_visits_sentences = sample_dict["VisitSentences"]
n_visits = len(d_visits_sentences)
# TODO add filtering patients with small number of visits - add to configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put this op in static_pipeline, you can return None in order to filter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the filtering patients on the number of visits to the raw data loading method together with filtering based on time (less than X hours in the hospital)


d_static = sample_dict["StaticDetails"]

if ("Height" in d_static.keys()) & ("Weight" in d_static.keys()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not? is it a sample you want to keep or not?
If yes, maybe you should add -1 as a dummy value?

Copy link
Collaborator Author

@ellabarkan ellabarkan Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to add BMI with -1 to the dict of static values in case it can't be calculated from missing Height or Weight.
It will be further converted to nan during percentile calculations and will be skipped in the next steps of trajectories build.

d_static = sample_dict["StaticDetails"]

if ("Height" in d_static.keys()) & ("Weight" in d_static.keys()):
height = d_static["Height"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's ndict, you can also access it as sample_dict["StaticDetails.Height"]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


d_static = sample_dict["StaticDetails"]

if ("Height" in d_static.keys()) & ("Weight" in d_static.keys()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Height" in d_static
will also work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# convert continuous measurements to categorical ones based on defined percentiles

# mapping static clinical characteristics (Age, Gender, ICU type, Height, etc)
for k in sample_dict["StaticDetails"].keys():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for k in sample_dict["StaticDetails"]:
Will also work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed



class OpMapToCategorical(OpBase):
def __call__(self, sample_dict, percentiles: dict) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • maybe rename percentiles to bins here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in all places in the code?

@mosheraboh
Copy link
Collaborator

This example is based on Vadim (@floccinauc) research.
@ellabarkan implemented the dataset and tokenizer, and I implemented the model and training script.
@michalozeryflato @SagiPolaczek , I'm merging now cause we need it for Sunday.
But please review and add comments/questions to anyone of us.

@mosheraboh
Copy link
Collaborator

@ellabarkan , we still need to create a unittest. Can you please work on it in a new PR next week?

Copy link
Collaborator

@SagiPolaczek SagiPolaczek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some delay I did a quick review and added few small comments inline 😄

The code looks nice and clean! I think a small unittest can be useful here to make sure the example won't break in the future.

aux_next_vis_classification: ${aux_next_vis_classification}
next_vis_loss_weight: 0.1

# uncomment to track in clearml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a better practice to use flags. for example:

config:

  clearml:
    track: 1 # 0 or 1
    project_name: "ehr_transformer"
    task_name: ${name}
    tags: "fuse_example"
    reuse_last_task_id: True
    continue_last_task: False

and in the code:

if cfg.clearml.track:
    # blablabla

# continue_last_task: False


# uncomment for SGD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like the above :)

STATIC_FIELDS = ["Age", "Gender", "Height", "ICUType", "Weight"]


class OpAddBMI(OpBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool :)

Suggestion:
if you think its useful - we can start an ops_clincal.py file and store it over there for future reuse (might apply also for some UKBB ops)

:param track_clearml: optional - to track with clearml provide arguments to start_clearml_logger()
"""

if track_clearml is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where we can fit the flag (as a continuation for the previous comment)

@@ -521,3 +521,16 @@ def __call__(self, sample_dict: NDict, key: str, value: Any) -> Union[None, dict
"""
sample_dict[key] = value
return sample_dict


class OpSetIfNotExist(OpBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add it to the ops list in the data/REAME in a future PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants