Pairwise comparisons #127

slobodan-ilic · 2019-01-14T20:39:20Z

No description provided.

coveralls · 2019-01-14T20:46:14Z

Coverage remained the same at 100.0% when pulling 05fadb2 on pairwise-comparisons into e75b7ef on master.

malecki · 2019-01-15T14:45:42Z

src/cr/cube/distributions/wishart.py

+
+def wishartCDF(X, n_min, n_max):
+    w_p = np.zeros(X.shape)
+    for i, x in np.ndenumerate(X):


we could only evaluate the lower triangle and copy. this is still, as far as I can tell, sufficiently fast even doing more than double the work. consider benchmarking.

malecki · 2019-01-15T14:47:18Z

src/cr/cube/distributions/wishart.py

+        }
+    }
+    """
+    size = n_min + (n_min % 2)


there are 4 commonly encountered special cases that I also don’t do here.

src/cr/cube/distributions/wishart.py

src/cr/cube/crunch_cube.py

src/cr/cube/cube_slice.py

src/cr/cube/distributions/wishart.py

src/cr/cube/cube_slice.py

scanny

Okay, here's some feedback :)

scanny · 2019-01-16T18:37:30Z

src/cr/cube/crunch_cube.py

@@ -713,6 +713,18 @@ def zscore(self, weighted=True, prune=False, hs_dims=None):
        res = [s.zscore(weighted, prune, hs_dims) for s in self.slices]
        return np.array(res) if self.ndim == 3 else res[0]

+    def pairwise_pvals(self, axis=0):
+        """Return list of square ndarrays of pairwise Chi-square along axis.


The Python library documentation convention is to format parameter names in the docstring as italics (surrounded by asterisk on each end) like *axis*. I find that a useful convention and you've probably seen it in my docstrings. I find it useful because it clearly sets off the role of call parameters.

scanny · 2019-01-16T18:39:38Z

src/cr/cube/crunch_cube.py

+        """Return list of square ndarrays of pairwise Chi-square along axis.
+
+        Square, symmetric matrix along *axis* of pairwise p-values for the
+        null hypothesis that col[i] = col[j] for each pair of columns.


This could be better worded, starting with making it a complete sentence, perhaps: "The return value is a square, symmetrical matrix ...". Always good to mention what it's useful for too, like "suitable for use as an input to a Chi-squared calculation" or whatever it's good for.

scanny · 2019-01-16T18:44:22Z

src/cr/cube/crunch_cube.py

+
+        :param axis: axis along which to perform comparison. Only columns (0)
+        are implemented currently.
+        :returns pvals: list of symmetric matrices of column-comparison p-values.


I find this Google parameter listing format noisy and that it encourages perfunctory documentation. I notice it's seldom used in the Python library documentation. You've already characterized the return value well and the axis parameter can be described in a concise sentence like "axis is the int index of the underlying slice ndarray axis along which to perform the comparison."

scanny · 2019-01-16T18:46:33Z

src/cr/cube/cube_slice.py

+
+        :param axis: axis along which to perform comparison. Only columns (0)
+        are implemented currently.
+        :returns pvals: square, symmetric matrix of column-comparison p-values.


This docstring can be improved the same way, perhaps by copying with a tweak or two from the one in PairwisePvalues.

scanny · 2019-01-16T18:47:17Z

src/cr/cube/cube_slice.py

    def pvals(self, weighted=True, prune=False, hs_dims=None):
-        """Return 2D ndarray with calculated p-vals.
+        """Return 2D ndarray with calculated P values


Full sentences get a period at the end.

scanny · 2019-01-16T18:53:15Z

src/cr/cube/distributions/wishart.py

+
+
+def wishartCDF(X, n_min, n_max):
+    w_p = np.zeros(X.shape)


If you're going to use terse variable names, perhaps to remain consistent with the mathematical formula, then place the mathematical formula in (an extended) comment along with a high-level explanation of the calculation and links to more detail. Otherwise this is pure magic and asks for an unreasonable amount of research pre-work from maintainers. This leads to it being glossed over which gives bugs a place to hide.

scanny · 2019-01-16T19:00:17Z

src/cr/cube/distributions/wishart.py

+            b -= q[i + j] / (g[i] * g[j + 1])
+            A[j + 1, i] = p[i] * p[j + 1] - 2 * b
+            A[i, j + 1] = -A[j + 1, i]
+    return np.sqrt(det(A))


To me, this thing screams out for implementation as a value object, which would encapsulate all these closely-related functions into a coherent class that was naturally understood as a single piece. It also would allow you to make most of these functions lazyproperties and would give you a handy place to extract a few more methods and/or properties. This function in particular looks too big to me. I can't really understand it because the notation is so terse and is not introduced in comments. But just given its size I'm inclined to think this function is doing well more than one thing. Perhaps it's implementation could become just return np.sqrt(det(self._A_vector)) or something and then the implementation of ._A_vector become a few lines that suitably arranged the interaction of further extracted small units that were each independently testable. If those units are given good names, then suddenly the complex calculation becomes much more accessible to the uninitiated.

scanny · 2019-01-16T19:04:11Z

src/cr/cube/distributions/wishart.py

+    return np.sqrt(det(A))
+
+
+def _normalizing_const(n_min, n_max):


Every module, class, method, property and function requires a docstring consistent with PEP 257. This is part of your bare minimum obligation to a maintainer (which of course is most likely yourself).

scanny · 2019-01-16T19:10:43Z

src/cr/cube/measures/pairwise_pvalues.py

+        """zero based int representing the smaller of the two cube's dimension."""
+        return min(self.slice.get_shape()) - 1
+
+    def pairwise_chisq(self, axis=0, weighted=True):


Overall I like this class, although I wonder why axis and weighted can't be made construction-time arguments and make the entire interface lazyproperties. This is the characteristic of a value object; you create it and then you can read its values (no methods).

This decision is made based on expected usage, but a value object is inherently simpler to reason about and easier to understand. I'm strongly inclined to use them whenever possible and I generally find it possible. In the end the question is what the calling protocol will need to look like in the calling code, which is this class's reason for being.

scanny · 2019-01-16T19:14:27Z

src/cr/cube/measures/pairwise_pvalues.py

+
+    def __init__(self, slice_, axis=0):
+        self.slice = slice_
+        self.axis = axis


ALL instance variables should be private, especially in a value object. If you leave them public, that encourages a caller/user to access them and also allows them to mutate them, which breaks the value-integrity of the object. I have not found a good reason to ever have a public instance attribute.

While you're at it, only actually intended interface properties should be public (lack a leading underscore). If any of the lazyproperties below are just for internal use, like serve interim calculation purposes, make them private too. Private until proven public is a good policy in my experience.

slobodan-ilic · 2019-01-17T08:21:56Z

src/cr/cube/measures/pairwise_pvalues.py

+
+    @lazyproperty
+    def _wts(self):
+        # TODO: @mike - come up with a better docstring and property name


@malecki This property needs a better name and docstring (i.e. this is just a placeholder that I came up with based on the original name of the variable from within the function).

slobodan-ilic · 2019-01-17T08:26:34Z

src/cr/cube/measures/pairwise_pvalues.py

+        return self._slice.proportions(axis=self._axis)
+
+    @lazyproperty
+    def _n_max(self):


@malecki I'd like better names for both of these. This is terse. Would something like max_size, or length (although length is highly ambiguous)... We can live with this, but I'd like it to be better :)

Gonna resist this one: I already changed the paper’s $ n_{mat} $ to the more obvious size but the n_min and n_max are used in nearly every part and breaking that connection to Chiani (2014) makes it harder to read not easier.

* Do more work in PairwisePvalues and less in cube * Take slice as a parameter, instead of calculated stats * Don't raise axis error from PairwisePvalues, only do that at the top level (from the slice, and consequently the cube) * Move properties from CubeSlice to PairwisePvalues where possible * Bring coverage back to 100% * Make as much functionality in PairwisePvalues (as possible) a lazyproperty * Make class member fields private * Normalize PairwisePvalues * Implement WishartCDF as a value object * Extract some functionality to lazyproperties * Add a class Pfaffian, for better composition * Insert links to the PDF of the paper that describes the algorithms * Insert pylint's directive for ignoring invalid-name at module level * Keep mathematical names for properties, with adequate docstrings

slobodan-ilic force-pushed the pairwise-comparisons branch from b8741a9 to 501e296 Compare January 15, 2019 14:42

malecki reviewed Jan 15, 2019

View reviewed changes