Skip to content

Conversation

h-mayorquin
Copy link
Contributor

Hi,
Currently, we have a check to determine if the timestamps associated with the data exhibit discontinuities. If they do, we throw an assertion and prevent the file from being read. I believe this is beneficial as it serves as a strong signal that the user should take a closer look at the file. This assertion is warranted.

But I also believe that once we infrormed the users that there is an error we should still allow the users to open the file if so they desired. This is good for two reasons:

  1. Users can use the machinery of neo to find out what might have been wrong which would be a valuable service.
  2. Sometimes, there is no other way, that is, the person who ran the experiments is not longer available but people might still want to analyze the data. I think this is a user case that should be covered.

This PR allows this to happen by introducing another boolean at instantiation that overpass this error and open the file anway. That is, it just allows the desired behavior with a boolean and keeps the default as it is for backwards compatbility.

@zm711
Copy link
Contributor

zm711 commented May 6, 2024

@h-mayorquin, two comments. (One a suggestion and then one a more philosophical discussion point).

I think I would raise the warning even if someone set strict_mode_for_timestamps=False so they know which datasets have errors and which ones don't. I think if the user decides to switch that argument to False we still need to distinguish between "good" and "corrupted" datasets. (but in this case it would be a warning instead of an error).

My only worry is that if we think of the case of a block of data (in the Intan sense not in the neo.Block sense), that could mean the whole data is scrambled because all packets after that issue are messed up. (I think, I don't know this empirically). Is the user-choice valuable enough to allow them to play around with potentially truly messed up data? (ie how many users do you think could truly go block by block and salvage some data vs would open a file and try to process it like normal despite it needing some post-curation).

I'm really torn on this. But I'm willing to be convinced (I've had one corrupted file due to another user of our shared computer) and I thought about trying to salvage it, but decided I didn't want to expend the effort.

@h-mayorquin
Copy link
Contributor Author

h-mayorquin commented May 6, 2024

I think I would raise the warning even if someone set strict_mode_for_timestamps=False so they know which datasets have errors and which ones don't. I think if the user decides to switch that argument to False we still need to distinguish between "good" and "corrupted" datasets. (but in this case it would be a warning instead of an error).

I did it like this at the beginning and then I changed. If they switch the default to False it means they know what they have to know. A warnings should be actionable and should tell you what to do to avoid it. If we added a warning, what would be the action to take to remove it?

My only worry is that if we think of the case of a block of data (in the Intan sense not in the neo.Block sense), that could mean the whole data is scrambled because all packets after that issue are messed up. (I think, I don't know this empirically). Is the user-choice valuable enough to allow them to play around with potentially truly messed up data? (ie how many users do you think could truly go block by block and salvage some data vs would open a file and try to process it like normal despite it needing some post-curation).

Yes, we told them there is a problem and stopped them at their tracks, gave them a big bool so they can go on to the forbidden land on their own will with no guarantees in our part. Why play the police?

As I don't want to play the police I think that the improvements would be on making it more clear that they are going to dangerous place. Maybe the variable should be unsafe_loading=False or load_data_event_if_corrupted. Something to make it clear that "don't use to load data in bulk as there might be dragons"

@zm711
Copy link
Contributor

zm711 commented May 6, 2024

I did it like this at the beginning and then I changed. If they switch the default to False it means they know what they have to know. A warnings should be actionable and should tell you what to do to avoid it. If we added a warning, what would be the action to take to remove it?

Maybe then instead we could add an attribute or a header element that provides a record that this file is a corrupted or uncorrupted file. Someone might just globally take on the risk, but I think we should still give them a way to know that this particular dataset is corrupt. So header element is probably easiest and most visibile.

Maybe the variable should be unsafe_loading=False or load_data_event_if_corrupted. Something to make it clear that "don't use to load data in bulk as there might be dragons"

I think this is a good point. The name timestamps doesn't quite make it clear the potential gravity of the dataset. Since we also get this error for some legitimate cases I also don't know if load_data_even_if_corrupted is fair either. unsafe_loading might be fair with an explanation that it doesn't check for proper timestamps which are a proxy in intan for data integrity.

@h-mayorquin
Copy link
Contributor Author

Maybe then instead we could add an attribute or a header element that provides a record that this file is a corrupted or uncorrupted file. Someone might just globally take on the risk, but I think we should still give them a way to know that this particular dataset is corrupt. So header element is probably easiest and most visibile.

This makes sense to me so people can use that in loops if so they desire by checking against that attribute.

@h-mayorquin
Copy link
Contributor Author

@zm711 I added your suggestions. I am not satisfied yet with the docstring writing and the variable name but this will give an idea of the structure and we can improve from there.

@zm711
Copy link
Contributor

zm711 commented May 6, 2024

I think the structure looks good to me now, but I agree the naming probably needs to be thought about a bit more. It's more a load_data_unsafely because we aren't applying all checks which does make the data "unsafe" but saying "unsafe data" doesn't sound quite right.

@h-mayorquin
Copy link
Contributor Author

load_data_unsafely. I think this is a good idea

integrity_checks=True by default is another option.

What do you think of the name of the attribute that keeps the state of the check?

@h-mayorquin
Copy link
Contributor Author

Changed here to load_data_unsafely.

Copy link
Contributor

@zm711 zm711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed a typo to fix. Otherwise I think this is fine for me. Unless @alejoe91 or @samuelgarcia are against this? But this works for me if it helps you.

BaseRawIO.__init__(self)
self.filename = filename
self.load_data_unsafely = load_data_unsafely
self.unsafe_timestamps = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed the default for load_data_unsafely is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected in the last commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually just thinking about this maybe rather than unsafe_timestamps we be more explicit and say something like noncontinuous_timestamps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Agree. Changing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@h-mayorquin
Copy link
Contributor Author

What do you think of ignore_integrity_checks vs load_data_unsafely

@zm711
Copy link
Contributor

zm711 commented May 29, 2024

I think that might be a better name. Although technically we are only doing one check so it would be ignore_integrity_check. I'm not the biggest fan of saying unsafe or unsafely in general because it is ambiguous. But ignore_integrity_check or if we want to be more explicit ignore_timestamp_integrity_check.

Looks like CI safe for some reason since both RTD and testing didn't work. Probably just need a fresh push.

@h-mayorquin
Copy link
Contributor Author

I think that might be a better name. Although technically we are only doing one check so it would be ignore_integrity_check. I'm not the biggest fan of saying unsafe or unsafely in general because it is ambiguous. But ignore_integrity_check or if we want to be more explicit ignore_timestamp_integrity_check.

Looks like CI safe for some reason since both RTD and testing didn't work. Probably just need a fresh push.

I am leaving open the possibility that we might have more than one check in the future. Thinking that we might also want to have something similar in other extractors (consistent API) and they might have different integrity checks.

So, integrity checks as a generic name and then a bunch of flags for every type of checks (here being the discontinous timestamps)

@h-mayorquin h-mayorquin force-pushed the add_possiblity_of_reading_file branch from 40e6b6e to 8fc1955 Compare May 30, 2024 08:37
ignore_integrity_checks: bool, default: False
If True, data that violates integrity assumptions will be loaded. At the moment the only integrity
check we perform is that timestamps are continuous. Setting this to True will ignore this check and set
the attribute `discontinous_timestamps` to True if the timestamps are not continous. This attribute can be checked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the attribute `discontinous_timestamps` to True if the timestamps are not continous. This attribute can be checked
the attribute `discontinuous_timestamps` to True if the timestamps are not continuous. This attribute can be checked

Copy link
Contributor

@zm711 zm711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants