-
Notifications
You must be signed in to change notification settings - Fork 298
Dask pp #2318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask pp #2318
Conversation
|
Hi @pp-mo please may you target the dask feature branch with such changes in the short term, rather than master? thank you |
Sorry, forgot about that ! |
|
WIP |
7a4e5d1 to
524775c
Compare
|
Selected limited testing now working !! Hoping you like it, as I'm wanting to build on this testing strategy in further work ... |
DPeterK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo like this! Just a few thoughts here from me.
| if isinstance(other_attr, biggus.NumpyArrayAdapter): | ||
| other_attr = other_attr.concrete | ||
| self_attr = as_concrete_data(getattr(self, attr)) | ||
| other_attr = as_concrete_data(getattr(other, attr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm intrigued you have to concrete each attr to compare it - can dask do lazy object comparisons? Is such a thing even possible??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It certainly is possible : (self_attr == other_attr) would be a lazy object, i.e. it hasn't looked at the data yet.
(and you can further process that, index it, etc, all still being lazy).
So, I think we really want this code to be explicit that it is realising the content here.
In fact, I fully expected that you would always be required to call 'compute' + nothing else will do it (as with biggus),
however it seems that applying np.all() will realise it anyway (and so will do the compare).
In my view that is not actually nice, and may even be a bug, as it's not documented anywhere -- see comment on #2308
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkillick Yup, I can confirm that dask maintains the laziness when comparing lazy objects; it just generates another lazy dask graph, which you need to realize to get the answer. So the following gives a lazy result:
result = getattr(self, attr) == getattr(other, attr)And the answer is realized with, for example:
>>> result.compute()
array([ True, True, True, True, True, True, True, True, True, True], dtype=bool)Or indeed, as @pp-mo suggests, the lazy dask result is made concreate by np.all:
>>> np.all(result)
True... which is all good to know, and pretty darn cool!
| # You should have received a copy of the GNU Lesser General Public License | ||
| # along with Iris. If not, see <http://www.gnu.org/licenses/>. | ||
| """ | ||
| Test lazy data handlingin iris.fileformats.pp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "handlingin" --> "handling in".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix!
| @@ -1,4 +1,4 @@ | |||
| # (C) British Crown Copyright 2010 - 2016, Met Office | |||
| # (C) British Crown Copyright 2010 - 2017, Met Office | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some changes, then didn't need them + reverted them all in a subsequent commit.
But then you can't revert the date without the licence check failing.
I think you can remove this when you squish (if you remember!).
| cube = self.cube | ||
| raw_data = cube._my_data | ||
| lazy = cube.lazy_data() | ||
| self.assertIs(cube.lazy_data(), raw_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to check the type of lazy here too? I don't think you've checked that cube.lazy_data returns a dask object.
If you do choose to do this then this test method will get quite long, with lots of assertions, so it would be worth looking into splitting it up a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to be explicit about the type, having hidden details behind the iris._lazy_data interface.
It also doesn't seem right to use is_lazy_data when we are also testing that here in another class.
I'll fix it anyway : I have done that in test_load above.
| self.assertIsNot(cube.lazy_data(), raw_data) | ||
| self.assertArrayAllClose(lazy.compute(), raw_data.compute()) | ||
|
|
||
| def test_lazy_data__set(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this name implies you're going to run a set() operation on the lazy data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll rename these functions.
I thought "test_lazy_data__new_content" but its getting lengthy for the subsequent ones.
How about...
test_lazy_data__newdata
test_lazy_data__fail_newdata_bad_shape
test_lazy_data__fail_newdata_not_lazy
New thought:
TODO:
can usefully split the class into tests for ".data"; ".has_lazy_data" and ".lazy_data()".
That would add context so you can then simplify the testcase names
|
|
||
|
|
||
| # A magic value, borrowed from biggus | ||
| _MAX_CHUNK_SIZE = 8 * 1024 * 1024 * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo Difficult to know what this should default to when it's unknown (at this point) what type of operation is going to be performed, as the choice of chunking should really be aligned with the expected operation/use in order to be optimal (from what I understand)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this is obviously a preliminary.
I propose to "just not worry" about this for now !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
| A :class:`biggus.Array` representing the multi-dimensional | ||
| data of the Cube. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo Should we care about updating the doc-string at this point ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I might as well...
| # Unmask the array only if it is filled. | ||
| if isinstance(data, np.ndarray) and ma.count_masked(data) == 0: | ||
| if (isinstance(data, np.ma.masked_array) and | ||
| ma.count_masked(data) == 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo It's not possible to put a masked array into a dask.array.Array (well, certainly one that does have its mask set) ... so is this really still valid in the new dask world? Or am I missing something here ...
Are you imagining that a user has a non-masked masked array wrapped up in a dask.array.Array ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I'm just totally ignoring the masked issue.
This code is effectively a dead branch with the latest changes, as as_concrete_data will never return a masked result.
But this will need fixing later, so I left it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed at length 😄
| from iris._lazy_data import is_lazy_data, as_lazy_data, as_concrete_data | ||
|
|
||
|
|
||
| class MixinLazyTests(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo MixinLazyTestData really ... kinda, sorta, rather than tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
|
||
|
|
||
| class Test_as_lazy_data(MixinLazyTests, tests.IrisTest): | ||
| def test_lazy(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo This is really a test_lazy_pass_thru ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed now (coming soon)...
|
Latest hopefully addresses the outstanding review comments ? |
|
ok, late to the party here, but I am fully on board with this implementation, given it's target is the 'dask' feature branch i am highly minded to merge this and work from here, e.g. in #2319 any objections to a merge, i'll aim to merge GMT am tomorrow otherwise |
Note "temporary" testing approach : just selected new tests, for the new code functions :
python -m unittest discover -v lib/iris/tests/integration/temp_daskAddresses #2307