-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Authors and pca interface #43
Conversation
Pull Request Test Coverage Report for Build 300
💛 - Coveralls |
Think I got everything. We'll wait on @dburkhardt before merging. |
This all looks great, thanks for adding this @stanleyjs! A couple of suggestions. To follow the sklearn API, we should use Passing in nonpositive or non-numeric input to rank_threshold should throw a Can you provide a reference in the docs about the rank_threshold estimation? I'm not familiar with this approach, do you have a citation? |
@dburkhardt I think the newest commit handles your desired interface with a caveat. I wanted to tell users that they're not going to get any threshold when This is clearly not desirable as most of the time people will set References: I've seen this idea everywhere in the machine learning literature when people talk about low rank matrix completion, rank estimation, low rank factorizations, epsilon approximations... I have not personally read this book, but Numpy cites "Numerical Recipes" here https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.matrix_rank.html I know that "Numerical Recipes" is a classic reference. It's also the threshold that MATLAB uses by default. I'd also check out "Matrix Computations" by Golub and Van Loan. I added a reference to "Numerical Recipes" in the docstrings. Ultimately I would like to do something more efficient than this SVD approach, but that will be for a different PR. I don't know why it's failing builds, it's working locally. I'll talk to Scott |
Thanks for the PR! |
Two things:
1) A change to the PCA interface.
2) A tweak to the authorship in setup.py
1):
There are now essentially three states to
n_pca
.If
n_pca
is a positive integer it behaves as beforeIf
n_pca in [0,None, False]
, no PCs are used (so it behaves liken_pca == None
from previous versions)If
n_pca in [True,'Adaptive']
(and upper/lower_case permutations ofadaptive
), then n_pca is estimated from the rank of the data.This estimate is done by first computing all of the singular values, then taking only the PCs above the threshold. This involves instantiating the PCA operator and then tweaking it after the fit.
The threshold is set with the parameter
rank_threshold
which defaults torank_threshold = None
. Any nonpositive or non-numeric input torank_threshold
defaults to None.When
rank_threshold==None
, the threshold is set by (largest singular value of the data) * epsilon * max(data.shape).If
rank_threshold
is positive numeric then it just uses that threshold. Ifn_pca not in [True, 'Adaptive']
, thenrank_threshold
is ignore (a warning is raised ifrank_threshold != None and n_pca not in [True,'Adaptive']
).This rank thresholding is one of the proper ways to estimate the rank of a matrix. However, there are a number of other ways to do this that are quicker or have a smaller memory footprint. For one, it is possible to estimate the spectral norm of a matrix and then only compute the singular pairs above that threshold. I have not studied the available sparse/PCA modules to identify if one allows for this iterative approach. I know it is merely a modification of truncated approaches.
Another method would be to use a truncated SVD where one passes the rank in first. To do that we need an estimate of the number of singular values that are above our threshold. numpy does this by just computing all the singular values and counting them. This is akin to what we are doing, except it may be faster as they are not computing the singular vecs. A related way would be a randomized approach to estimate the distribution of the singular values and use that to choose our rank, Finally, there is an
sklearn.linalg.interpolative
module that handles a lot of these numerical tricks for us. However, I have not gotten their rank estimates to line up with what I expect from unit tests.Regardless, I have added a bunch of tests for this new interface and functionality.
2)
The previous authors list for the package
setup.py
was "Jay Stanley, @scottgigante , Krishnaswamy lab, Yale University"I changed this to " @scottgigante, @dburkhardt, Jay Stanley, Yale University"
I do not really care about the ordering here. I put it in order of my impression of most commits.
It is open source software. Any one can see who commits to this repo and how much they do. At the end of the day I would be happy with just names without affiliations (university or otherwise). If we want/need to add affiliations there is a readme / github page where we can be more verbose.