-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pairwise comparisons #127
Pairwise comparisons #127
Conversation
b8741a9
to
501e296
Compare
src/cr/cube/distributions/wishart.py
Outdated
|
||
def wishartCDF(X, n_min, n_max): | ||
w_p = np.zeros(X.shape) | ||
for i, x in np.ndenumerate(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could only evaluate the lower triangle and copy. this is still, as far as I can tell, sufficiently fast even doing more than double the work. consider benchmarking.
src/cr/cube/distributions/wishart.py
Outdated
} | ||
} | ||
""" | ||
size = n_min + (n_min % 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 4 commonly encountered special cases that I also don’t do here.
dd504f5
to
99f6076
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, here's some feedback :)
src/cr/cube/crunch_cube.py
Outdated
@@ -713,6 +713,18 @@ def zscore(self, weighted=True, prune=False, hs_dims=None): | |||
res = [s.zscore(weighted, prune, hs_dims) for s in self.slices] | |||
return np.array(res) if self.ndim == 3 else res[0] | |||
|
|||
def pairwise_pvals(self, axis=0): | |||
"""Return list of square ndarrays of pairwise Chi-square along axis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python library documentation convention is to format parameter names in the docstring as italics (surrounded by asterisk on each end) like *axis*
. I find that a useful convention and you've probably seen it in my docstrings. I find it useful because it clearly sets off the role of call parameters.
"""Return list of square ndarrays of pairwise Chi-square along axis. | ||
|
||
Square, symmetric matrix along *axis* of pairwise p-values for the | ||
null hypothesis that col[i] = col[j] for each pair of columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be better worded, starting with making it a complete sentence, perhaps: "The return value is a square, symmetrical matrix ...". Always good to mention what it's useful for too, like "suitable for use as an input to a Chi-squared calculation" or whatever it's good for.
src/cr/cube/crunch_cube.py
Outdated
|
||
:param axis: axis along which to perform comparison. Only columns (0) | ||
are implemented currently. | ||
:returns pvals: list of symmetric matrices of column-comparison p-values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this Google parameter listing format noisy and that it encourages perfunctory documentation. I notice it's seldom used in the Python library documentation. You've already characterized the return value well and the axis parameter can be described in a concise sentence like "axis is the int index of the underlying slice ndarray axis along which to perform the comparison."
src/cr/cube/cube_slice.py
Outdated
|
||
:param axis: axis along which to perform comparison. Only columns (0) | ||
are implemented currently. | ||
:returns pvals: square, symmetric matrix of column-comparison p-values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring can be improved the same way, perhaps by copying with a tweak or two from the one in PairwisePvalues
.
def pvals(self, weighted=True, prune=False, hs_dims=None): | ||
"""Return 2D ndarray with calculated p-vals. | ||
"""Return 2D ndarray with calculated P values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full sentences get a period at the end.
src/cr/cube/distributions/wishart.py
Outdated
|
||
|
||
def wishartCDF(X, n_min, n_max): | ||
w_p = np.zeros(X.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to use terse variable names, perhaps to remain consistent with the mathematical formula, then place the mathematical formula in (an extended) comment along with a high-level explanation of the calculation and links to more detail. Otherwise this is pure magic and asks for an unreasonable amount of research pre-work from maintainers. This leads to it being glossed over which gives bugs a place to hide.
src/cr/cube/distributions/wishart.py
Outdated
b -= q[i + j] / (g[i] * g[j + 1]) | ||
A[j + 1, i] = p[i] * p[j + 1] - 2 * b | ||
A[i, j + 1] = -A[j + 1, i] | ||
return np.sqrt(det(A)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, this thing screams out for implementation as a value object, which would encapsulate all these closely-related functions into a coherent class that was naturally understood as a single piece. It also would allow you to make most of these functions lazyproperties and would give you a handy place to extract a few more methods and/or properties. This function in particular looks too big to me. I can't really understand it because the notation is so terse and is not introduced in comments. But just given its size I'm inclined to think this function is doing well more than one thing. Perhaps it's implementation could become just return np.sqrt(det(self._A_vector))
or something and then the implementation of ._A_vector
become a few lines that suitably arranged the interaction of further extracted small units that were each independently testable. If those units are given good names, then suddenly the complex calculation becomes much more accessible to the uninitiated.
src/cr/cube/distributions/wishart.py
Outdated
return np.sqrt(det(A)) | ||
|
||
|
||
def _normalizing_const(n_min, n_max): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every module, class, method, property and function requires a docstring consistent with PEP 257. This is part of your bare minimum obligation to a maintainer (which of course is most likely yourself).
"""zero based int representing the smaller of the two cube's dimension.""" | ||
return min(self.slice.get_shape()) - 1 | ||
|
||
def pairwise_chisq(self, axis=0, weighted=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like this class, although I wonder why axis
and weighted
can't be made construction-time arguments and make the entire interface lazyproperties. This is the characteristic of a value object; you create it and then you can read its values (no methods).
This decision is made based on expected usage, but a value object is inherently simpler to reason about and easier to understand. I'm strongly inclined to use them whenever possible and I generally find it possible. In the end the question is what the calling protocol will need to look like in the calling code, which is this class's reason for being.
|
||
def __init__(self, slice_, axis=0): | ||
self.slice = slice_ | ||
self.axis = axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALL instance variables should be private, especially in a value object. If you leave them public, that encourages a caller/user to access them and also allows them to mutate them, which breaks the value-integrity of the object. I have not found a good reason to ever have a public instance attribute.
While you're at it, only actually intended interface properties should be public (lack a leading underscore). If any of the lazyproperties below are just for internal use, like serve interim calculation purposes, make them private too. Private until proven public is a good policy in my experience.
379db8f
to
2c464aa
Compare
|
||
@lazyproperty | ||
def _wts(self): | ||
# TODO: @mike - come up with a better docstring and property name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malecki This property needs a better name and docstring (i.e. this is just a placeholder that I came up with based on the original name of the variable from within the function).
return self._slice.proportions(axis=self._axis) | ||
|
||
@lazyproperty | ||
def _n_max(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malecki I'd like better names for both of these. This is terse. Would something like max_size
, or length
(although length is highly ambiguous)... We can live with this, but I'd like it to be better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna resist this one: I already changed the paper’s $ n_{mat} $ to the more obvious size
but the n_min and n_max are used in nearly every part and breaking that connection to Chiani (2014) makes it harder to read not easier.
501c541
to
232c7a3
Compare
* Do more work in PairwisePvalues and less in cube * Take slice as a parameter, instead of calculated stats * Don't raise axis error from PairwisePvalues, only do that at the top level (from the slice, and consequently the cube) * Move properties from CubeSlice to PairwisePvalues where possible * Bring coverage back to 100% * Make as much functionality in PairwisePvalues (as possible) a lazyproperty * Make class member fields private * Normalize PairwisePvalues * Implement WishartCDF as a value object * Extract some functionality to lazyproperties * Add a class Pfaffian, for better composition * Insert links to the PDF of the paper that describes the algorithms * Insert pylint's directive for ignoring invalid-name at module level * Keep mathematical names for properties, with adequate docstrings
10fc471
to
05fadb2
Compare
No description provided.