Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add I/O for masked Traces #49

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

malcolmw
Copy link

This pull request is for a branch that I use to work with masked Traces. Masked values are filled with a fill value automatically determined based on the data array's dtype at write-time, which is stored in the Dataset's attributes and later used to reconstruct the mask at read-time.

Malcolm White added 2 commits May 19, 2018 13:26
Make backwards compatible with older ASDF versions


I/O of masked arrays of float/integer data.

Ignore some test files.
Make backwards compatible with older ASDF versions


I/O of masked arrays of float/integer data.

Ignore some test files.
Fix broken identity test
@coveralls
Copy link

coveralls commented May 19, 2018

Coverage Status

Coverage increased (+0.1%) to 89.162% when pulling c6db4d2 on malcolmw:dev/masked_trace_IO into fb88b35 on SeismicData:master.

Copy link
Member

@krischer krischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heyhey,

first of all I'm really sorry for taking so long!

I like this and the implementation looks solid to me.

There is one point I'd like to discuss a bit: Instead of relying on a "magic" mask value, how about storing the actual mask as an attribute?

With some trickery it could be made so it takes 1 bit per array item. It would be a bit slow but as we are talking about masked arrays I'm not too worried here.

Aside from that the currently chosen mask value is machine dependent - probably not relevant for the platforms we really care about but its a bit of a sore spot.

For unsigned integer arrays the currently chosen mask value would be zero which does not really work. Also for lowish precision integers the risk for false positives might be a bit too large - for floating points I agree that this would not be an issue.

All in all I think the that using an actual mask array would solve these issues at the expense of storage size and performance.

@krischer
Copy link
Member

Hi @malcolmw

Is there still an interest in following up this?

@malcolmw
Copy link
Author

Hi, @krischer,

This is not a priority for me anymore, so feel free to close this PR if it isn't a significant value-adding feature for pyASDF. However, if you think it is a useful feature you would like to merge to facilitate work with long segments of potentially-gappy continuous data, I am happy to help out.

@krischer
Copy link
Member

I do actually think that this would be a very nice addition to the data format. As pointed out in a comment above I'd prefer the data model of actually carrying along a second mask data set (could be name the same as the actual dataset, just prefixed with __MASK__ or so).

This would mirror to a certain extend how numpy's masked arrays work and it could also be properly integrated into the format. I could take care of adding it to the format definition and the validator if there is still interest in implementing this.

@malcolmw
Copy link
Author

Sounds good. I'm happy to implement, though it will be a while before I get to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants