-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subdiff v2 176856023 #253
Subdiff v2 176856023 #253
Conversation
len(column_subtotal.subtrahend_idxs) > 0 | ||
and len(row_subtotal.subtrahend_idxs) > 0 | ||
): | ||
return np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work because unweighted counts are currently stored as integers which don't have a nan
. Fix should probably be to make unweighted counts as floats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next commit is a way to fix, but I'm not sure it's the correct way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a good question. There are two approaches I can think of:
-
Adopt the rule that all numeric measure values (but not idxs like
.inserted_column_idxs
) are float, and remove that broadly docstring-documented and otherwise implemented distinction between int and float values. I kind of think this is the right approach, but it would entail a lot of small changes, including right down to the cube-measure level where we would "cast" the ints we get for unweighted counts to floats. -
Just remove the
dtype
setting here in subtotals and let numpy determine what type to make things and some are ints and some are floats and we don't pay too much attention. This is easy, but it feels sloppy and the kind of thing that would come to bite us later. As a general principle, we'd like to have a consistent interface, which includes consistent value types.
Let me know what you think and maybe we can discuss at stand-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I made the changes discussed (_UnweightedCountMeasure._flat_values
now converts to ndarrays of floats). I also took out the _dtype
methods in Subtotals in matrix.insertions.py
because it didn't seem useful now that everything should be a float.
I have not updated docstrings. I'm a little worried about making that many changes while you, Ernesto and I all need to rebase on each other. Do you think I need to do that now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we probably should I suppose. My branch is on master now, so the bulk is already handled, and the bits that Ernesto is doing is primarily adds rather than changes, so I don't think the merge conflicts are going to be too big for him. I'm pretty sure yours will go first, because he's got a lot of changes to do on his branch and I expect it will be Monday or so before he's ready to push.
8a645df
to
2a148a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, here's a first pass to get you started. Overall I think it looks great, just a few finer points on the tests mostly. I ran out of time before getting to most of the unit tests, but I can spend more time on those on the next pass.
a9e346e
to
371c42f
Compare
4dd7d38
to
c2e0369
Compare
be4739b
to
a7fb849
Compare
9f7024e
to
f9e0b2a
Compare
if type(x) is dict: | ||
return np.nan | ||
elif type(x) is list: | ||
return [dict_to_nan_recursive(y) for y in x] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scanny - is this the right way to do this?
The problem I hit is that because of numeric arrays, this can have two different forms:
Mean of non-array: list of floats or dictionaries (indicating missing values)
Mean of array: list of lists of floats or dictionaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say those are distinct enough to warrant two different subclasses, where the factory chooses between _MeanMeasure
and perhaps something like _NumArrayMeanMeasure
. Then each ._flat_values
can do its own thing and we avoid using if
statements as a substitute for polymorphism.
Btw, The docstring says "Return tuple ..." and the return value here is ndarray
(which I think is preferable).
This issue came up in my review of Ernesto's branch here: #251 (comment) where I explained that I think all ._flat_values
return type should be ndarray
to avoid the unnecessary time and memory required to construct an intermediate tuple
object that just gets immediately converted to an ndarray
anyway.
So maybe a conflict brewing here, but that happens and Ernesto can figure it out during rebase (although it might be worth a mention to him in Slack). In any case, I think the right solution is to create the _NumArrayMeanMeasure
or whatever name is appropriate and then adjust the factory in _Measures
or wherever it is to choose the right subclass when constructing the measure. Also that these return ndarray
and preferably that all of the four subtypes return ndarray
and the _BaseMeasure.raw_cube_array
property is modified to avoid constructing an array out of an array (which it looks like you already did).
Also the docstring for _WeightedCountMeasure._flat_values
also says tuple
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say those are distinct enough to warrant two different subclasses, where the factory chooses between _MeanMeasure and perhaps something like _NumArrayMeanMeasure. Then each ._flat_values can do its own thing and we avoid using if statements as a substitute for polymorphism.
Just saw your new set of comments, but the problem I'm hitting as I thought about this is that I don't see a good way to distinguish numeric arrays from not at the time of the factory. Did you have thoughts on that?
My first idea was to use use _BaseMeasure._shape
but the shape of a CAT(N categories) X Numeric is the same as a Numeric Array (N subvariables).
The other areas where the code makes decisions about numeric arrays (eg _numeric_array_dimension
) are in the Cube
class, but I don't see a good way to pass that information along when creating the measures.
src/cr/cube/cube.py
Outdated
@@ -798,9 +810,6 @@ def _flat_values(self): | |||
if valid_counts: | |||
return np.array(valid_counts, dtype=np.float64) | |||
|
|||
if result["counts"] == [None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not sure about this (I had previously introduced this line to avoid having to change the behavior of cube_part::Nub::is_empty
- see change below).
This felt cleaner to me because it makes the return type consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, a list containing the single value None
seems odd. Do we have JSON that looks like "counts": [null]
or something? That would be ugly indeed, as we could choose either null
or an empty array and surely one of those would suit better.
Anyway, I'll see what I think after looking at the Nub bit, otherwise I'd say good riddance to this line :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gergness a few comments I wanted to send your way before breaking for lunch, I'll continue when I get back.
src/cr/cube/cube.py
Outdated
raw_cube_array = np.array(self._flat_values).flatten().reshape(self._shape) | ||
raw_cube_array = self._flat_values.flatten().reshape(self._shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right on. Great minds think alike I suppose :)
"""Return tuple of mean values as found in cube response. | ||
"""Return ndarray of np.float64 values as found in cube response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's my habit, and perhaps more proper, to leave out the "Return " suffix on docstrings of properties (lazy or otherwise) since they behave like attributes. So we document a property like we would an attribute, by stating what it is rather that what it "returns", since returning a value is a characteristic of a function and attributes just "are" their value.
As a side benefit, it opens up first-docstring-line real estate for more descriptive words.
src/cr/cube/cube.py
Outdated
"""np.ndarray of np.float64 counts before weighting or None, if unavailable. | ||
|
||
Use floats to avoid int overflow bugs and so we can use nan. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One convention I've adopted, which is consistent with Python 3 type hints, is to prefix are return value that can also be None as "Optional ". So this would become "Optional 1D np.float64 ndarray of counts before weighting."
Then in the body of the docstring you can mention the circumstances under which it is None, like "This value is None
when the cube-result does not contain a cube-counts measure."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but this was mislabelled because it never returns None, right?
src/cr/cube/cube.py
Outdated
if valid_counts: | ||
return np.array(valid_counts, dtype=np.float64) | ||
|
||
return np.array(result["counts"], dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a very fine point, but just in case you're interested, I would generally phrase this as:
return (
np.array(valid_counts, dtype=np.float64)
if valid counts
else np.array(result["counts"], dtype=np.float64)
)
I'm not entirely sure why I like this better, except that as a reader you don't have to reason about two separate statements and maybe any unhandled "fall-through" cases or something; like it's a straight-line execution path and says "okay, we're returning a value now, there's no two ways about it, it's just a question of what that value is going to be.
Anyway do it the way you like best, just wanted to share my perspective on that.
src/cr/cube/cube.py
Outdated
return tuple(self._cube_dict["result"]["measures"]["count"]["data"]) | ||
return np.array( | ||
self._cube_dict["result"]["measures"]["count"]["data"], dtype=np.float64 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring still says tuple
but this is 1D np.float64 ndarray
now. But more importantly, I think this value needs to be optional now, like it is None
if there is no measures.count
field in the JSON.
Hmm, I suppose that's handled by the _Measures.is_weighted
property that checks for this in the payload ... but that's awfully far away. Hmm, the broader structure of responsibilities is a little borked up it looks like, with like _Measures.means
also checking the payload first.
I'd be inclined to think that ._flat_values
should be optional for all measures that are sometimes not there, which I believe is all measures. Then if we want to know whether it has a measure we check whether that measure is None and we leave the payload interpretation to the actual measure class that can fully encapsulate it.
Not sure if you feel up to that refactoring just now, but I'd say at least that this ._flat_values
should be optional and maybe fix the spot in _Measures.is_weighted
that inspects the payload for itself to rely on a value from _WeightedCountMeasure
rather than poke around in the JSON. I suppose that means that _BaseMeasure.raw_cube_array
would need to be optional too.
Hmm, this might require a little more noodling, and I don't want to launch you into a quagmire, but maybe you can have a think on it and we can spend a cafe minute on it or something. Either way, it can't hurt to start by making this optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after reflection and nourishment, here's what I think:
-
All
._flat_values
are typeOptional 1D float64 ndarray
(maybe numeric array is 2D or something, I don't understand that enough to say, but ndarray anyway). The value isNone
when there is no such cube-measure in the payload or the values it needs are otherwise not to be found. This way, only the measure objects need to know where to go in the payload to get their values and all other measure-related payload inspection can be eliminated. -
_BaseMeasure.raw_cube_array
is also optional. Since this is the main interface method of a measure object (and._flat_values
is private), it beingNone
is how any higher-level unit figures out whether that base measure exists in the payload. -
If we still need
_Measures.weighted_counts
to automatically substitute_UnweightedCountMeasure
when the cube is unweighted, then we give_Measures
a private._weighted_counts
property which is either a_WeightedCountMeasure
orNone
and let the interface property.weighted_counts
perform the test and substitution, likereturn self._weighted_counts if self._weighted_counts is not None else self.unweighted_counts
.
Overall, it still smells a little fishy, because I think that auto-substitution thing probably has to go, but this would at least remove part of the long-distance coupling and leakage of responsibilities.
Probably even better would be to add a .load()
classmethod to _BaseMeasure
and that is called for construction purposes rather than the subclass itself. That way it can return None
if the measure cannot be created and callers don't have to test .raw_cube_array
to see if their reference is valid or just an empty shell.
Let me know if this sounds like more than you bargained for and I can add a few-commit branch after yours merges. But it's not untypical that modules need some cleaning up before adding new features, so it wouldn't be a bad exercise to get this one tidied up a bit. This module mostly goes back to 2019 so it's a little bit crusty around the edges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I got this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is all the time I have for this today. I'll see if I can make it through the tests tomorrow.
src/cr/cube/cube.py
Outdated
return tuple(self._cube_dict["result"]["measures"]["count"]["data"]) | ||
return np.array( | ||
self._cube_dict["result"]["measures"]["count"]["data"], dtype=np.float64 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after reflection and nourishment, here's what I think:
-
All
._flat_values
are typeOptional 1D float64 ndarray
(maybe numeric array is 2D or something, I don't understand that enough to say, but ndarray anyway). The value isNone
when there is no such cube-measure in the payload or the values it needs are otherwise not to be found. This way, only the measure objects need to know where to go in the payload to get their values and all other measure-related payload inspection can be eliminated. -
_BaseMeasure.raw_cube_array
is also optional. Since this is the main interface method of a measure object (and._flat_values
is private), it beingNone
is how any higher-level unit figures out whether that base measure exists in the payload. -
If we still need
_Measures.weighted_counts
to automatically substitute_UnweightedCountMeasure
when the cube is unweighted, then we give_Measures
a private._weighted_counts
property which is either a_WeightedCountMeasure
orNone
and let the interface property.weighted_counts
perform the test and substitution, likereturn self._weighted_counts if self._weighted_counts is not None else self.unweighted_counts
.
Overall, it still smells a little fishy, because I think that auto-substitution thing probably has to go, but this would at least remove part of the long-distance coupling and leakage of responsibilities.
Probably even better would be to add a .load()
classmethod to _BaseMeasure
and that is called for construction purposes rather than the subclass itself. That way it can return None
if the measure cannot be created and callers don't have to test .raw_cube_array
to see if their reference is valid or just an empty shell.
Let me know if this sounds like more than you bargained for and I can add a few-commit branch after yours merges. But it's not untypical that modules need some cleaning up before adding new features, so it wouldn't be a bad exercise to get this one tidied up a bit. This module mostly goes back to 2019 so it's a little bit crusty around the edges.
6c395e3
to
dbc4e70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, LGTM, very nice job :)
Adds the subtotal "difference" insertions where some categories can be subtracted off. Builds off of Steve's new-architecture-strand branch.