Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pairwise comparisons #127

Merged
merged 12 commits into from
Jan 18, 2019
Merged

Pairwise comparisons #127

merged 12 commits into from
Jan 18, 2019

Conversation

slobodan-ilic
Copy link
Contributor

No description provided.

@coveralls
Copy link

coveralls commented Jan 14, 2019

Coverage Status

Coverage remained the same at 100.0% when pulling 05fadb2 on pairwise-comparisons into e75b7ef on master.


def wishartCDF(X, n_min, n_max):
w_p = np.zeros(X.shape)
for i, x in np.ndenumerate(X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could only evaluate the lower triangle and copy. this is still, as far as I can tell, sufficiently fast even doing more than double the work. consider benchmarking.

}
}
"""
size = n_min + (n_min % 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 4 commonly encountered special cases that I also don’t do here.

src/cr/cube/distributions/wishart.py Outdated Show resolved Hide resolved
src/cr/cube/distributions/wishart.py Outdated Show resolved Hide resolved
src/cr/cube/distributions/wishart.py Outdated Show resolved Hide resolved
src/cr/cube/distributions/wishart.py Outdated Show resolved Hide resolved
src/cr/cube/cube_slice.py Outdated Show resolved Hide resolved
@slobodan-ilic slobodan-ilic force-pushed the pairwise-comparisons branch 3 times, most recently from dd504f5 to 99f6076 Compare January 16, 2019 17:18
Copy link
Contributor

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, here's some feedback :)

@@ -713,6 +713,18 @@ def zscore(self, weighted=True, prune=False, hs_dims=None):
res = [s.zscore(weighted, prune, hs_dims) for s in self.slices]
return np.array(res) if self.ndim == 3 else res[0]

def pairwise_pvals(self, axis=0):
"""Return list of square ndarrays of pairwise Chi-square along axis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python library documentation convention is to format parameter names in the docstring as italics (surrounded by asterisk on each end) like *axis*. I find that a useful convention and you've probably seen it in my docstrings. I find it useful because it clearly sets off the role of call parameters.

"""Return list of square ndarrays of pairwise Chi-square along axis.

Square, symmetric matrix along *axis* of pairwise p-values for the
null hypothesis that col[i] = col[j] for each pair of columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be better worded, starting with making it a complete sentence, perhaps: "The return value is a square, symmetrical matrix ...". Always good to mention what it's useful for too, like "suitable for use as an input to a Chi-squared calculation" or whatever it's good for.


:param axis: axis along which to perform comparison. Only columns (0)
are implemented currently.
:returns pvals: list of symmetric matrices of column-comparison p-values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this Google parameter listing format noisy and that it encourages perfunctory documentation. I notice it's seldom used in the Python library documentation. You've already characterized the return value well and the axis parameter can be described in a concise sentence like "axis is the int index of the underlying slice ndarray axis along which to perform the comparison."


:param axis: axis along which to perform comparison. Only columns (0)
are implemented currently.
:returns pvals: square, symmetric matrix of column-comparison p-values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring can be improved the same way, perhaps by copying with a tweak or two from the one in PairwisePvalues.

def pvals(self, weighted=True, prune=False, hs_dims=None):
"""Return 2D ndarray with calculated p-vals.
"""Return 2D ndarray with calculated P values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full sentences get a period at the end.



def wishartCDF(X, n_min, n_max):
w_p = np.zeros(X.shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to use terse variable names, perhaps to remain consistent with the mathematical formula, then place the mathematical formula in (an extended) comment along with a high-level explanation of the calculation and links to more detail. Otherwise this is pure magic and asks for an unreasonable amount of research pre-work from maintainers. This leads to it being glossed over which gives bugs a place to hide.

b -= q[i + j] / (g[i] * g[j + 1])
A[j + 1, i] = p[i] * p[j + 1] - 2 * b
A[i, j + 1] = -A[j + 1, i]
return np.sqrt(det(A))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this thing screams out for implementation as a value object, which would encapsulate all these closely-related functions into a coherent class that was naturally understood as a single piece. It also would allow you to make most of these functions lazyproperties and would give you a handy place to extract a few more methods and/or properties. This function in particular looks too big to me. I can't really understand it because the notation is so terse and is not introduced in comments. But just given its size I'm inclined to think this function is doing well more than one thing. Perhaps it's implementation could become just return np.sqrt(det(self._A_vector)) or something and then the implementation of ._A_vector become a few lines that suitably arranged the interaction of further extracted small units that were each independently testable. If those units are given good names, then suddenly the complex calculation becomes much more accessible to the uninitiated.

return np.sqrt(det(A))


def _normalizing_const(n_min, n_max):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every module, class, method, property and function requires a docstring consistent with PEP 257. This is part of your bare minimum obligation to a maintainer (which of course is most likely yourself).

"""zero based int representing the smaller of the two cube's dimension."""
return min(self.slice.get_shape()) - 1

def pairwise_chisq(self, axis=0, weighted=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this class, although I wonder why axis and weighted can't be made construction-time arguments and make the entire interface lazyproperties. This is the characteristic of a value object; you create it and then you can read its values (no methods).

This decision is made based on expected usage, but a value object is inherently simpler to reason about and easier to understand. I'm strongly inclined to use them whenever possible and I generally find it possible. In the end the question is what the calling protocol will need to look like in the calling code, which is this class's reason for being.


def __init__(self, slice_, axis=0):
self.slice = slice_
self.axis = axis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALL instance variables should be private, especially in a value object. If you leave them public, that encourages a caller/user to access them and also allows them to mutate them, which breaks the value-integrity of the object. I have not found a good reason to ever have a public instance attribute.

While you're at it, only actually intended interface properties should be public (lack a leading underscore). If any of the lazyproperties below are just for internal use, like serve interim calculation purposes, make them private too. Private until proven public is a good policy in my experience.

@slobodan-ilic slobodan-ilic force-pushed the pairwise-comparisons branch 2 times, most recently from 379db8f to 2c464aa Compare January 17, 2019 08:19

@lazyproperty
def _wts(self):
# TODO: @mike - come up with a better docstring and property name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malecki This property needs a better name and docstring (i.e. this is just a placeholder that I came up with based on the original name of the variable from within the function).

return self._slice.proportions(axis=self._axis)

@lazyproperty
def _n_max(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malecki I'd like better names for both of these. This is terse. Would something like max_size, or length (although length is highly ambiguous)... We can live with this, but I'd like it to be better :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna resist this one: I already changed the paper’s $ n_{mat} $ to the more obvious size but the n_min and n_max are used in nearly every part and breaking that connection to Chiani (2014) makes it harder to read not easier.

malecki and others added 2 commits January 18, 2019 15:37
* Do more work in PairwisePvalues and less in cube
* Take slice as a parameter, instead of calculated stats
* Don't raise axis error from PairwisePvalues, only do that at the top
level (from the slice, and consequently the cube)
* Move properties from CubeSlice to PairwisePvalues where possible
* Bring coverage back to 100%

* Make as much functionality in PairwisePvalues (as possible) a lazyproperty
* Make class member fields private
* Normalize PairwisePvalues

* Implement WishartCDF as a value object
* Extract some functionality to lazyproperties
* Add a class Pfaffian, for better composition

* Insert links to the PDF of the paper that describes the algorithms
* Insert pylint's directive for ignoring invalid-name at module level
* Keep mathematical names for properties, with adequate docstrings
@slobodan-ilic slobodan-ilic merged commit 6056085 into master Jan 18, 2019
@ernestoarbitrio ernestoarbitrio deleted the pairwise-comparisons branch August 31, 2020 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants