Subdiff v2 176856023 #253

gergness · 2021-02-19T02:58:23Z

Adds the subtotal "difference" insertions where some categories can be subtracted off. Builds off of Steve's new-architecture-strand branch.

tests/integration/test_headers_and_subtotals.py

gergness · 2021-02-22T17:27:28Z

src/cr/cube/matrix/subtotals.py

+            len(column_subtotal.subtrahend_idxs) > 0
+            and len(row_subtotal.subtrahend_idxs) > 0
+        ):
+            return np.nan


This doesn't work because unweighted counts are currently stored as integers which don't have a nan. Fix should probably be to make unweighted counts as floats

Next commit is a way to fix, but I'm not sure it's the correct way.

Yeah, it's a good question. There are two approaches I can think of:

Adopt the rule that all numeric measure values (but not idxs like .inserted_column_idxs) are float, and remove that broadly docstring-documented and otherwise implemented distinction between int and float values. I kind of think this is the right approach, but it would entail a lot of small changes, including right down to the cube-measure level where we would "cast" the ints we get for unweighted counts to floats.

Just remove the dtype setting here in subtotals and let numpy determine what type to make things and some are ints and some are floats and we don't pay too much attention. This is easy, but it feels sloppy and the kind of thing that would come to bite us later. As a general principle, we'd like to have a consistent interface, which includes consistent value types.

Let me know what you think and maybe we can discuss at stand-up.

Okay, I made the changes discussed (_UnweightedCountMeasure._flat_values now converts to ndarrays of floats). I also took out the _dtype methods in Subtotals in matrix.insertions.py because it didn't seem useful now that everything should be a float.

I have not updated docstrings. I'm a little worried about making that many changes while you, Ernesto and I all need to rebase on each other. Do you think I need to do that now?

Well, we probably should I suppose. My branch is on master now, so the bulk is already handled, and the bits that Ernesto is doing is primarily adds rather than changes, so I don't think the merge conflicts are going to be too big for him. I'm pretty sure yours will go first, because he's got a lot of changes to do on his branch and I expect it will be Monday or so before he's ready to push.

scanny

Okay, here's a first pass to get you started. Overall I think it looks great, just a few finer points on the tests mostly. I ran out of time before getting to most of the unit tests, but I can spend more time on those on the next pass.

src/cr/cube/cubepart.py

tests/fixtures/ca-subdiff.json

tests/unit/matrix/test_measure.py

tests/fixtures/cat-subdiff-x-cat-subdiff.json

tests/fixtures/cat-subdiff-x-mr.json

tests/fixtures/cat-subdiff.json

tests/integration/test_headers_and_subtotals.py

coveralls · 2021-03-03T00:31:13Z

Coverage decreased (-17.7%) to 82.348% when pulling e43aba4 on subdiff-v2-176856023 into dd04970 on master.

gergness · 2021-03-03T17:14:09Z

src/cr/cube/cube.py

+            if type(x) is dict:
+                return np.nan
+            elif type(x) is list:
+                return [dict_to_nan_recursive(y) for y in x]


@scanny - is this the right way to do this?

The problem I hit is that because of numeric arrays, this can have two different forms:

Mean of non-array: list of floats or dictionaries (indicating missing values)
Mean of array: list of lists of floats or dictionaries

I'd say those are distinct enough to warrant two different subclasses, where the factory chooses between _MeanMeasure and perhaps something like _NumArrayMeanMeasure. Then each ._flat_values can do its own thing and we avoid using if statements as a substitute for polymorphism.

Btw, The docstring says "Return tuple ..." and the return value here is ndarray (which I think is preferable).

This issue came up in my review of Ernesto's branch here: #251 (comment) where I explained that I think all ._flat_values return type should be ndarray to avoid the unnecessary time and memory required to construct an intermediate tuple object that just gets immediately converted to an ndarray anyway.

So maybe a conflict brewing here, but that happens and Ernesto can figure it out during rebase (although it might be worth a mention to him in Slack). In any case, I think the right solution is to create the _NumArrayMeanMeasure or whatever name is appropriate and then adjust the factory in _Measures or wherever it is to choose the right subclass when constructing the measure. Also that these return ndarray and preferably that all of the four subtypes return ndarray and the _BaseMeasure.raw_cube_array property is modified to avoid constructing an array out of an array (which it looks like you already did).

Also the docstring for _WeightedCountMeasure._flat_values also says tuple.

I'd say those are distinct enough to warrant two different subclasses, where the factory chooses between _MeanMeasure and perhaps something like _NumArrayMeanMeasure. Then each ._flat_values can do its own thing and we avoid using if statements as a substitute for polymorphism.

Just saw your new set of comments, but the problem I'm hitting as I thought about this is that I don't see a good way to distinguish numeric arrays from not at the time of the factory. Did you have thoughts on that?

My first idea was to use use _BaseMeasure._shape but the shape of a CAT(N categories) X Numeric is the same as a Numeric Array (N subvariables).

The other areas where the code makes decisions about numeric arrays (eg _numeric_array_dimension) are in the Cube class, but I don't see a good way to pass that information along when creating the measures.

gergness · 2021-03-03T17:16:47Z

src/cr/cube/cube.py

@@ -798,9 +810,6 @@ def _flat_values(self):
        if valid_counts:
            return np.array(valid_counts, dtype=np.float64)

-        if result["counts"] == [None]:


Also not sure about this (I had previously introduced this line to avoid having to change the behavior of cube_part::Nub::is_empty - see change below).

This felt cleaner to me because it makes the return type consistent.

Hmm, a list containing the single value None seems odd. Do we have JSON that looks like "counts": [null] or something? That would be ugly indeed, as we could choose either null or an empty array and surely one of those would suit better.

Anyway, I'll see what I think after looking at the Nub bit, otherwise I'd say good riddance to this line :)

scanny

@gergness a few comments I wanted to send your way before breaking for lunch, I'll continue when I get back.

scanny · 2021-03-03T22:10:02Z

src/cr/cube/cube.py

-        raw_cube_array = np.array(self._flat_values).flatten().reshape(self._shape)
+        raw_cube_array = self._flat_values.flatten().reshape(self._shape)


Right on. Great minds think alike I suppose :)

scanny · 2021-03-03T22:12:47Z

src/cr/cube/cube.py

-        """Return tuple of mean values as found in cube response.
+        """Return ndarray of np.float64 values as found in cube response.


It's my habit, and perhaps more proper, to leave out the "Return " suffix on docstrings of properties (lazy or otherwise) since they behave like attributes. So we document a property like we would an attribute, by stating what it is rather that what it "returns", since returning a value is a characteristic of a function and attributes just "are" their value.

As a side benefit, it opens up first-docstring-line real estate for more descriptive words.

scanny · 2021-03-03T22:18:05Z

src/cr/cube/cube.py

+        """np.ndarray of np.float64 counts before weighting or None, if unavailable.
+
+        Use floats to avoid int overflow bugs and so we can use nan.
+        """


One convention I've adopted, which is consistent with Python 3 type hints, is to prefix are return value that can also be None as "Optional ". So this would become "Optional 1D np.float64 ndarray of counts before weighting."

Then in the body of the docstring you can mention the circumstances under which it is None, like "This value is None when the cube-result does not contain a cube-counts measure."

Hm, but this was mislabelled because it never returns None, right?

scanny · 2021-03-03T22:26:04Z

src/cr/cube/cube.py

+        if valid_counts:
+            return np.array(valid_counts, dtype=np.float64)
+
+        return np.array(result["counts"], dtype=np.float64)


It's a very fine point, but just in case you're interested, I would generally phrase this as:

return ( np.array(valid_counts, dtype=np.float64) if valid counts else np.array(result["counts"], dtype=np.float64) )

I'm not entirely sure why I like this better, except that as a reader you don't have to reason about two separate statements and maybe any unhandled "fall-through" cases or something; like it's a straight-line execution path and says "okay, we're returning a value now, there's no two ways about it, it's just a question of what that value is going to be.

Anyway do it the way you like best, just wanted to share my perspective on that.

scanny · 2021-03-03T22:40:43Z

src/cr/cube/cube.py

-        return tuple(self._cube_dict["result"]["measures"]["count"]["data"])
+        return np.array(
+            self._cube_dict["result"]["measures"]["count"]["data"], dtype=np.float64
+        )


The docstring still says tuple but this is 1D np.float64 ndarray now. But more importantly, I think this value needs to be optional now, like it is None if there is no measures.count field in the JSON.

Hmm, I suppose that's handled by the _Measures.is_weighted property that checks for this in the payload ... but that's awfully far away. Hmm, the broader structure of responsibilities is a little borked up it looks like, with like _Measures.means also checking the payload first.

I'd be inclined to think that ._flat_values should be optional for all measures that are sometimes not there, which I believe is all measures. Then if we want to know whether it has a measure we check whether that measure is None and we leave the payload interpretation to the actual measure class that can fully encapsulate it.

Not sure if you feel up to that refactoring just now, but I'd say at least that this ._flat_values should be optional and maybe fix the spot in _Measures.is_weighted that inspects the payload for itself to rely on a value from _WeightedCountMeasure rather than poke around in the JSON. I suppose that means that _BaseMeasure.raw_cube_array would need to be optional too.

Hmm, this might require a little more noodling, and I don't want to launch you into a quagmire, but maybe you can have a think on it and we can spend a cafe minute on it or something. Either way, it can't hurt to start by making this optional.

Okay, after reflection and nourishment, here's what I think:

All ._flat_values are type Optional 1D float64 ndarray (maybe numeric array is 2D or something, I don't understand that enough to say, but ndarray anyway). The value is None when there is no such cube-measure in the payload or the values it needs are otherwise not to be found. This way, only the measure objects need to know where to go in the payload to get their values and all other measure-related payload inspection can be eliminated.

_BaseMeasure.raw_cube_array is also optional. Since this is the main interface method of a measure object (and ._flat_values is private), it being None is how any higher-level unit figures out whether that base measure exists in the payload.

If we still need _Measures.weighted_counts to automatically substitute _UnweightedCountMeasure when the cube is unweighted, then we give _Measures a private ._weighted_counts property which is either a _WeightedCountMeasure or None and let the interface property .weighted_counts perform the test and substitution, like return self._weighted_counts if self._weighted_counts is not None else self.unweighted_counts.

Overall, it still smells a little fishy, because I think that auto-substitution thing probably has to go, but this would at least remove part of the long-distance coupling and leakage of responsibilities.

Probably even better would be to add a .load() classmethod to _BaseMeasure and that is called for construction purposes rather than the subclass itself. That way it can return None if the measure cannot be created and callers don't have to test .raw_cube_array to see if their reference is valid or just an empty shell.

Let me know if this sounds like more than you bargained for and I can add a few-commit branch after yours merges. But it's not untypical that modules need some cleaning up before adding new features, so it wouldn't be a bad exercise to get this one tidied up a bit. This module mostly goes back to 2019 so it's a little bit crusty around the edges.

Okay, I think I got this now

scanny

Okay, this is all the time I have for this today. I'll see if I can make it through the tests tomorrow.

scanny · 2021-03-04T00:14:10Z

src/cr/cube/cube.py

-        return tuple(self._cube_dict["result"]["measures"]["count"]["data"])
+        return np.array(
+            self._cube_dict["result"]["measures"]["count"]["data"], dtype=np.float64
+        )


Okay, after reflection and nourishment, here's what I think:

All ._flat_values are type Optional 1D float64 ndarray (maybe numeric array is 2D or something, I don't understand that enough to say, but ndarray anyway). The value is None when there is no such cube-measure in the payload or the values it needs are otherwise not to be found. This way, only the measure objects need to know where to go in the payload to get their values and all other measure-related payload inspection can be eliminated.

_BaseMeasure.raw_cube_array is also optional. Since this is the main interface method of a measure object (and ._flat_values is private), it being None is how any higher-level unit figures out whether that base measure exists in the payload.

If we still need _Measures.weighted_counts to automatically substitute _UnweightedCountMeasure when the cube is unweighted, then we give _Measures a private ._weighted_counts property which is either a _WeightedCountMeasure or None and let the interface property .weighted_counts perform the test and substitution, like return self._weighted_counts if self._weighted_counts is not None else self.unweighted_counts.

Overall, it still smells a little fishy, because I think that auto-substitution thing probably has to go, but this would at least remove part of the long-distance coupling and leakage of responsibilities.

Probably even better would be to add a .load() classmethod to _BaseMeasure and that is called for construction purposes rather than the subclass itself. That way it can return None if the measure cannot be created and callers don't have to test .raw_cube_array to see if their reference is valid or just an empty shell.

Let me know if this sounds like more than you bargained for and I can add a few-commit branch after yours merges. But it's not untypical that modules need some cleaning up before adding new features, so it wouldn't be a bad exercise to get this one tidied up a bit. This module mostly goes back to 2019 so it's a little bit crusty around the edges.

src/cr/cube/cubepart.py

src/cr/cube/dimension.py

src/cr/cube/matrix/assembler.py

src/cr/cube/matrix/subtotals.py

src/cr/cube/stripe/insertion.py

scanny

Okay, LGTM, very nice job :)

gergness requested a review from slobodan-ilic February 19, 2021 02:58

slobodan-ilic reviewed Feb 19, 2021

View reviewed changes

tests/integration/test_headers_and_subtotals.py Outdated Show resolved Hide resolved

slobodan-ilic reviewed Feb 19, 2021

View reviewed changes

tests/integration/test_headers_and_subtotals.py Outdated Show resolved Hide resolved

gergness commented Feb 20, 2021

View reviewed changes

tests/integration/test_headers_and_subtotals.py Outdated Show resolved Hide resolved

tests/integration/test_headers_and_subtotals.py Outdated Show resolved Hide resolved

gergness commented Feb 22, 2021

View reviewed changes

gergness force-pushed the subdiff-v2-176856023 branch 2 times, most recently from 8a645df to 2a148a8 Compare February 23, 2021 19:24

gergness requested a review from scanny February 23, 2021 19:30

scanny suggested changes Feb 23, 2021

View reviewed changes

scanny force-pushed the new-architecture-strand-176841194 branch from a9e346e to 371c42f Compare March 2, 2021 00:14

gergness force-pushed the subdiff-v2-176856023 branch from 4dd7d38 to c2e0369 Compare March 2, 2021 20:50

gergness changed the base branch from new-architecture-strand-176841194 to master March 2, 2021 21:17

gergness added 18 commits March 2, 2021 18:29

Add integration tests for subtotal diffs

941379c

add subtrahend_ids

a87a07c

add subtrahend_idxs

2864497

stripe.SumSubtotals now adds subtrahends

f9bfcf3

can now subtract subrahends when necessary too

1191278

use SumDiffSubtotal

498b7e5

matrix.SumSubtotal is now subtrahend aware

7aac0dd

tablestderr is difference-aware

d598472

Zscore is subtrahend aware

af53518

add matrix.SumDiffSubtotals

9405231

use subtrahends in matrix

a8acce4

Oops need to subdiff in MR margin

b28bf1e

model row_proportions after column_proportions

b625fd6

use assembler.row_proportion

b88bcdc

intersection of subdiffs should be undefined?

690aa52

a fix for changing from int to float

577d31e

integration test for corresponding dim prop is undefined for subdiffs

cf8568a

SumDiffSubtotal can now override row or column

e9a6b90

override corresponding dim proportion

6b71fc1

gergness force-pushed the subdiff-v2-176856023 branch from be4739b to a7fb849 Compare March 3, 2021 00:29

gergness added 7 commits March 3, 2021 11:06

start of steve comments

9863c72

refactor integration tests

67b5fa1

convert to float at source

46d33f1

subtotals no longer claim to control dtype

faa25c4

fix test broken during rebase

b214681

All _flat_values are np.array

2ecae53

update docstrings to remove outdated np.int references

f9e0b2a

gergness force-pushed the subdiff-v2-176856023 branch from 9f7024e to f9e0b2a Compare March 3, 2021 17:09

gergness commented Mar 3, 2021

View reviewed changes

gergness requested a review from scanny March 3, 2021 17:18

scanny suggested changes Mar 3, 2021

View reviewed changes

scanny suggested changes Mar 4, 2021

View reviewed changes

gergness added 2 commits March 4, 2021 10:14

respond to easy part of second review

4830e2b

refactor where cube determines if means are available

cdc0842

gergness requested a review from scanny March 4, 2021 21:18

refactor where weighted counts are optional

dbc4e70

gergness force-pushed the subdiff-v2-176856023 branch from 6c395e3 to dbc4e70 Compare March 4, 2021 21:19

oops I guess this wasn't needed

e43aba4

scanny approved these changes Mar 5, 2021

View reviewed changes

gergness merged commit 918e9fe into master Mar 5, 2021

ernestoarbitrio deleted the subdiff-v2-176856023 branch December 20, 2022 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subdiff v2 176856023 #253

Subdiff v2 176856023 #253

gergness commented Feb 19, 2021

gergness Feb 22, 2021

gergness Feb 23, 2021

scanny Feb 24, 2021

gergness Feb 25, 2021

scanny Mar 3, 2021

scanny left a comment

coveralls commented Mar 3, 2021 •

edited

Loading

gergness Mar 3, 2021

scanny Mar 3, 2021 •

edited

Loading

gergness Mar 4, 2021

gergness Mar 3, 2021 •

edited

Loading

scanny Mar 3, 2021

scanny left a comment

scanny Mar 3, 2021

scanny Mar 3, 2021

scanny Mar 3, 2021

gergness Mar 4, 2021

scanny Mar 3, 2021

scanny Mar 3, 2021

scanny Mar 4, 2021

gergness Mar 4, 2021

scanny left a comment

scanny Mar 4, 2021

scanny left a comment

		raw_cube_array = np.array(self._flat_values).flatten().reshape(self._shape)
		raw_cube_array = self._flat_values.flatten().reshape(self._shape)

		"""Return tuple of mean values as found in cube response.
		"""Return ndarray of np.float64 values as found in cube response.

Subdiff v2 176856023 #253

Subdiff v2 176856023 #253

Conversation

gergness commented Feb 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scanny left a comment

Choose a reason for hiding this comment

coveralls commented Mar 3, 2021 • edited Loading

Choose a reason for hiding this comment

scanny Mar 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gergness Mar 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scanny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scanny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scanny left a comment

Choose a reason for hiding this comment

coveralls commented Mar 3, 2021 •

edited

Loading

scanny Mar 3, 2021 •

edited

Loading

gergness Mar 3, 2021 •

edited

Loading