Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subdiff v2 176856023 #253

Merged
merged 30 commits into from
Mar 5, 2021
Merged

Subdiff v2 176856023 #253

merged 30 commits into from
Mar 5, 2021

Conversation

gergness
Copy link
Contributor

Adds the subtotal "difference" insertions where some categories can be subtracted off. Builds off of Steve's new-architecture-strand branch.

len(column_subtotal.subtrahend_idxs) > 0
and len(row_subtotal.subtrahend_idxs) > 0
):
return np.nan
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work because unweighted counts are currently stored as integers which don't have a nan. Fix should probably be to make unweighted counts as floats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next commit is a way to fix, but I'm not sure it's the correct way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a good question. There are two approaches I can think of:

  1. Adopt the rule that all numeric measure values (but not idxs like .inserted_column_idxs) are float, and remove that broadly docstring-documented and otherwise implemented distinction between int and float values. I kind of think this is the right approach, but it would entail a lot of small changes, including right down to the cube-measure level where we would "cast" the ints we get for unweighted counts to floats.

  2. Just remove the dtype setting here in subtotals and let numpy determine what type to make things and some are ints and some are floats and we don't pay too much attention. This is easy, but it feels sloppy and the kind of thing that would come to bite us later. As a general principle, we'd like to have a consistent interface, which includes consistent value types.

Let me know what you think and maybe we can discuss at stand-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I made the changes discussed (_UnweightedCountMeasure._flat_values now converts to ndarrays of floats). I also took out the _dtype methods in Subtotals in matrix.insertions.py because it didn't seem useful now that everything should be a float.

I have not updated docstrings. I'm a little worried about making that many changes while you, Ernesto and I all need to rebase on each other. Do you think I need to do that now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we probably should I suppose. My branch is on master now, so the bulk is already handled, and the bits that Ernesto is doing is primarily adds rather than changes, so I don't think the merge conflicts are going to be too big for him. I'm pretty sure yours will go first, because he's got a lot of changes to do on his branch and I expect it will be Monday or so before he's ready to push.

@gergness gergness force-pushed the subdiff-v2-176856023 branch 2 times, most recently from 8a645df to 2a148a8 Compare February 23, 2021 19:24
@gergness gergness requested a review from scanny February 23, 2021 19:30
Copy link
Contributor

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, here's a first pass to get you started. Overall I think it looks great, just a few finer points on the tests mostly. I ran out of time before getting to most of the unit tests, but I can spend more time on those on the next pass.

src/cr/cube/cubepart.py Outdated Show resolved Hide resolved
src/cr/cube/cubepart.py Outdated Show resolved Hide resolved
src/cr/cube/cubepart.py Outdated Show resolved Hide resolved
tests/fixtures/ca-subdiff.json Show resolved Hide resolved
tests/unit/matrix/test_measure.py Outdated Show resolved Hide resolved
tests/fixtures/cat-subdiff-x-cat-subdiff.json Outdated Show resolved Hide resolved
tests/fixtures/cat-subdiff-x-mr.json Outdated Show resolved Hide resolved
tests/fixtures/cat-subdiff.json Outdated Show resolved Hide resolved
tests/integration/test_headers_and_subtotals.py Outdated Show resolved Hide resolved
tests/integration/test_headers_and_subtotals.py Outdated Show resolved Hide resolved
@scanny scanny force-pushed the new-architecture-strand-176841194 branch from a9e346e to 371c42f Compare March 2, 2021 00:14
@gergness gergness changed the base branch from new-architecture-strand-176841194 to master March 2, 2021 21:17
@coveralls
Copy link

coveralls commented Mar 3, 2021

Coverage Status

Coverage decreased (-17.7%) to 82.348% when pulling e43aba4 on subdiff-v2-176856023 into dd04970 on master.

if type(x) is dict:
return np.nan
elif type(x) is list:
return [dict_to_nan_recursive(y) for y in x]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scanny - is this the right way to do this?

The problem I hit is that because of numeric arrays, this can have two different forms:

Mean of non-array: list of floats or dictionaries (indicating missing values)
Mean of array: list of lists of floats or dictionaries

Copy link
Contributor

@scanny scanny Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say those are distinct enough to warrant two different subclasses, where the factory chooses between _MeanMeasure and perhaps something like _NumArrayMeanMeasure. Then each ._flat_values can do its own thing and we avoid using if statements as a substitute for polymorphism.

Btw, The docstring says "Return tuple ..." and the return value here is ndarray (which I think is preferable).

This issue came up in my review of Ernesto's branch here: #251 (comment) where I explained that I think all ._flat_values return type should be ndarray to avoid the unnecessary time and memory required to construct an intermediate tuple object that just gets immediately converted to an ndarray anyway.

So maybe a conflict brewing here, but that happens and Ernesto can figure it out during rebase (although it might be worth a mention to him in Slack). In any case, I think the right solution is to create the _NumArrayMeanMeasure or whatever name is appropriate and then adjust the factory in _Measures or wherever it is to choose the right subclass when constructing the measure. Also that these return ndarray and preferably that all of the four subtypes return ndarray and the _BaseMeasure.raw_cube_array property is modified to avoid constructing an array out of an array (which it looks like you already did).

Also the docstring for _WeightedCountMeasure._flat_values also says tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say those are distinct enough to warrant two different subclasses, where the factory chooses between _MeanMeasure and perhaps something like _NumArrayMeanMeasure. Then each ._flat_values can do its own thing and we avoid using if statements as a substitute for polymorphism.

Just saw your new set of comments, but the problem I'm hitting as I thought about this is that I don't see a good way to distinguish numeric arrays from not at the time of the factory. Did you have thoughts on that?

My first idea was to use use _BaseMeasure._shape but the shape of a CAT(N categories) X Numeric is the same as a Numeric Array (N subvariables).

The other areas where the code makes decisions about numeric arrays (eg _numeric_array_dimension) are in the Cube class, but I don't see a good way to pass that information along when creating the measures.

@@ -798,9 +810,6 @@ def _flat_values(self):
if valid_counts:
return np.array(valid_counts, dtype=np.float64)

if result["counts"] == [None]:
Copy link
Contributor Author

@gergness gergness Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure about this (I had previously introduced this line to avoid having to change the behavior of cube_part::Nub::is_empty - see change below).

This felt cleaner to me because it makes the return type consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, a list containing the single value None seems odd. Do we have JSON that looks like "counts": [null] or something? That would be ugly indeed, as we could choose either null or an empty array and surely one of those would suit better.

Anyway, I'll see what I think after looking at the Nub bit, otherwise I'd say good riddance to this line :)

@gergness gergness requested a review from scanny March 3, 2021 17:18
Copy link
Contributor

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gergness a few comments I wanted to send your way before breaking for lunch, I'll continue when I get back.

raw_cube_array = np.array(self._flat_values).flatten().reshape(self._shape)
raw_cube_array = self._flat_values.flatten().reshape(self._shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right on. Great minds think alike I suppose :)

"""Return tuple of mean values as found in cube response.
"""Return ndarray of np.float64 values as found in cube response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my habit, and perhaps more proper, to leave out the "Return " suffix on docstrings of properties (lazy or otherwise) since they behave like attributes. So we document a property like we would an attribute, by stating what it is rather that what it "returns", since returning a value is a characteristic of a function and attributes just "are" their value.

As a side benefit, it opens up first-docstring-line real estate for more descriptive words.

Comment on lines 799 to 802
"""np.ndarray of np.float64 counts before weighting or None, if unavailable.

Use floats to avoid int overflow bugs and so we can use nan.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One convention I've adopted, which is consistent with Python 3 type hints, is to prefix are return value that can also be None as "Optional ". So this would become "Optional 1D np.float64 ndarray of counts before weighting."

Then in the body of the docstring you can mention the circumstances under which it is None, like "This value is None when the cube-result does not contain a cube-counts measure."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but this was mislabelled because it never returns None, right?

Comment on lines 810 to 813
if valid_counts:
return np.array(valid_counts, dtype=np.float64)

return np.array(result["counts"], dtype=np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very fine point, but just in case you're interested, I would generally phrase this as:

return (
    np.array(valid_counts, dtype=np.float64)
    if valid counts
    else np.array(result["counts"], dtype=np.float64)
)

I'm not entirely sure why I like this better, except that as a reader you don't have to reason about two separate statements and maybe any unhandled "fall-through" cases or something; like it's a straight-line execution path and says "okay, we're returning a value now, there's no two ways about it, it's just a question of what that value is going to be.

Anyway do it the way you like best, just wanted to share my perspective on that.

Comment on lines 809 to 824
return tuple(self._cube_dict["result"]["measures"]["count"]["data"])
return np.array(
self._cube_dict["result"]["measures"]["count"]["data"], dtype=np.float64
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring still says tuple but this is 1D np.float64 ndarray now. But more importantly, I think this value needs to be optional now, like it is None if there is no measures.count field in the JSON.

Hmm, I suppose that's handled by the _Measures.is_weighted property that checks for this in the payload ... but that's awfully far away. Hmm, the broader structure of responsibilities is a little borked up it looks like, with like _Measures.means also checking the payload first.

I'd be inclined to think that ._flat_values should be optional for all measures that are sometimes not there, which I believe is all measures. Then if we want to know whether it has a measure we check whether that measure is None and we leave the payload interpretation to the actual measure class that can fully encapsulate it.

Not sure if you feel up to that refactoring just now, but I'd say at least that this ._flat_values should be optional and maybe fix the spot in _Measures.is_weighted that inspects the payload for itself to rely on a value from _WeightedCountMeasure rather than poke around in the JSON. I suppose that means that _BaseMeasure.raw_cube_array would need to be optional too.

Hmm, this might require a little more noodling, and I don't want to launch you into a quagmire, but maybe you can have a think on it and we can spend a cafe minute on it or something. Either way, it can't hurt to start by making this optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after reflection and nourishment, here's what I think:

  1. All ._flat_values are type Optional 1D float64 ndarray (maybe numeric array is 2D or something, I don't understand that enough to say, but ndarray anyway). The value is None when there is no such cube-measure in the payload or the values it needs are otherwise not to be found. This way, only the measure objects need to know where to go in the payload to get their values and all other measure-related payload inspection can be eliminated.

  2. _BaseMeasure.raw_cube_array is also optional. Since this is the main interface method of a measure object (and ._flat_values is private), it being None is how any higher-level unit figures out whether that base measure exists in the payload.

  3. If we still need _Measures.weighted_counts to automatically substitute _UnweightedCountMeasure when the cube is unweighted, then we give _Measures a private ._weighted_counts property which is either a _WeightedCountMeasure or None and let the interface property .weighted_counts perform the test and substitution, like return self._weighted_counts if self._weighted_counts is not None else self.unweighted_counts.

Overall, it still smells a little fishy, because I think that auto-substitution thing probably has to go, but this would at least remove part of the long-distance coupling and leakage of responsibilities.

Probably even better would be to add a .load() classmethod to _BaseMeasure and that is called for construction purposes rather than the subclass itself. That way it can return None if the measure cannot be created and callers don't have to test .raw_cube_array to see if their reference is valid or just an empty shell.

Let me know if this sounds like more than you bargained for and I can add a few-commit branch after yours merges. But it's not untypical that modules need some cleaning up before adding new features, so it wouldn't be a bad exercise to get this one tidied up a bit. This module mostly goes back to 2019 so it's a little bit crusty around the edges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I got this now

Copy link
Contributor

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is all the time I have for this today. I'll see if I can make it through the tests tomorrow.

Comment on lines 809 to 824
return tuple(self._cube_dict["result"]["measures"]["count"]["data"])
return np.array(
self._cube_dict["result"]["measures"]["count"]["data"], dtype=np.float64
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after reflection and nourishment, here's what I think:

  1. All ._flat_values are type Optional 1D float64 ndarray (maybe numeric array is 2D or something, I don't understand that enough to say, but ndarray anyway). The value is None when there is no such cube-measure in the payload or the values it needs are otherwise not to be found. This way, only the measure objects need to know where to go in the payload to get their values and all other measure-related payload inspection can be eliminated.

  2. _BaseMeasure.raw_cube_array is also optional. Since this is the main interface method of a measure object (and ._flat_values is private), it being None is how any higher-level unit figures out whether that base measure exists in the payload.

  3. If we still need _Measures.weighted_counts to automatically substitute _UnweightedCountMeasure when the cube is unweighted, then we give _Measures a private ._weighted_counts property which is either a _WeightedCountMeasure or None and let the interface property .weighted_counts perform the test and substitution, like return self._weighted_counts if self._weighted_counts is not None else self.unweighted_counts.

Overall, it still smells a little fishy, because I think that auto-substitution thing probably has to go, but this would at least remove part of the long-distance coupling and leakage of responsibilities.

Probably even better would be to add a .load() classmethod to _BaseMeasure and that is called for construction purposes rather than the subclass itself. That way it can return None if the measure cannot be created and callers don't have to test .raw_cube_array to see if their reference is valid or just an empty shell.

Let me know if this sounds like more than you bargained for and I can add a few-commit branch after yours merges. But it's not untypical that modules need some cleaning up before adding new features, so it wouldn't be a bad exercise to get this one tidied up a bit. This module mostly goes back to 2019 so it's a little bit crusty around the edges.

src/cr/cube/cubepart.py Outdated Show resolved Hide resolved
src/cr/cube/cubepart.py Outdated Show resolved Hide resolved
src/cr/cube/cubepart.py Show resolved Hide resolved
src/cr/cube/dimension.py Outdated Show resolved Hide resolved
src/cr/cube/matrix/assembler.py Outdated Show resolved Hide resolved
src/cr/cube/matrix/assembler.py Outdated Show resolved Hide resolved
src/cr/cube/matrix/subtotals.py Outdated Show resolved Hide resolved
src/cr/cube/matrix/subtotals.py Outdated Show resolved Hide resolved
src/cr/cube/stripe/insertion.py Outdated Show resolved Hide resolved
@gergness gergness requested a review from scanny March 4, 2021 21:18
Copy link
Contributor

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, LGTM, very nice job :)

@gergness gergness merged commit 918e9fe into master Mar 5, 2021
@ernestoarbitrio ernestoarbitrio deleted the subdiff-v2-176856023 branch December 20, 2022 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants