Authors and pca interface #43

stanleyjs · 2019-09-04T20:37:39Z

Two things:
1) A change to the PCA interface.

2) A tweak to the authorship in setup.py

1):
There are now essentially three states to n_pca.

If n_pca is a positive integer it behaves as before
If n_pca in [0,None, False], no PCs are used (so it behaves like n_pca == None from previous versions)
If n_pca in [True,'Adaptive'] (and upper/lower_case permutations of adaptive), then n_pca is estimated from the rank of the data.

This estimate is done by first computing all of the singular values, then taking only the PCs above the threshold. This involves instantiating the PCA operator and then tweaking it after the fit.

The threshold is set with the parameter rank_threshold which defaults to rank_threshold = None. Any nonpositive or non-numeric input to rank_threshold defaults to None.
When rank_threshold==None, the threshold is set by (largest singular value of the data) * epsilon * max(data.shape).
If rank_threshold is positive numeric then it just uses that threshold. If n_pca not in [True, 'Adaptive'], then rank_threshold is ignore (a warning is raised if rank_threshold != None and n_pca not in [True,'Adaptive']).

This rank thresholding is one of the proper ways to estimate the rank of a matrix. However, there are a number of other ways to do this that are quicker or have a smaller memory footprint. For one, it is possible to estimate the spectral norm of a matrix and then only compute the singular pairs above that threshold. I have not studied the available sparse/PCA modules to identify if one allows for this iterative approach. I know it is merely a modification of truncated approaches.

Another method would be to use a truncated SVD where one passes the rank in first. To do that we need an estimate of the number of singular values that are above our threshold. numpy does this by just computing all the singular values and counting them. This is akin to what we are doing, except it may be faster as they are not computing the singular vecs. A related way would be a randomized approach to estimate the distribution of the singular values and use that to choose our rank, Finally, there is an sklearn.linalg.interpolative module that handles a lot of these numerical tricks for us. However, I have not gotten their rank estimates to line up with what I expect from unit tests.

Regardless, I have added a bunch of tests for this new interface and functionality.

2)
The previous authors list for the package setup.py was "Jay Stanley, @scottgigante , Krishnaswamy lab, Yale University"
I changed this to " @scottgigante, @dburkhardt, Jay Stanley, Yale University"
I do not really care about the ordering here. I put it in order of my impression of most commits.
It is open source software. Any one can see who commits to this repo and how much they do. At the end of the day I would be happy with just names without affiliations (university or otherwise). If we want/need to add affiliations there is a readme / github page where we can be more verbose.

see PR

coveralls · 2019-09-04T20:42:43Z

Pull Request Test Coverage Report for Build 300

69 of 70 (98.57%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 94.356%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
graphtools/base.py	67	68	98.53%

Totals
Change from base Build 285:	0.2%
Covered Lines:	953
Relevant Lines:	1010

💛 - Coveralls

graphtools/api.py

graphtools/base.py

graphtools/api.py

graphtools/base.py

graphtools/api.py

graphtools/base.py

test/test_data.py

stanleyjs · 2019-09-04T23:03:35Z

Think I got everything. We'll wait on @dburkhardt before merging.

Think I handled this

graphtools/api.py

graphtools/base.py

graphtools/graphs.py

dburkhardt · 2019-09-05T17:21:03Z

This all looks great, thanks for adding this @stanleyjs! A couple of suggestions.

To follow the sklearn API, we should use n_pca = 'auto' or rank_threshold='auto'for algorithmic selection of components/rank.

Passing in nonpositive or non-numeric input to rank_threshold should throw a ValueError instead of defaulting to None.

Can you provide a reference in the docs about the rank_threshold estimation? I'm not familiar with this approach, do you have a citation?

stanleyjs · 2019-09-05T19:17:10Z

@dburkhardt I think the newest commit handles your desired interface with a caveat.

I wanted to tell users that they're not going to get any threshold when rank_threshold has been set and n_pca not in ['auto', True]. However, if I set 'auto' as the default for rank_threshold, then by default rank_threshold = 'auto', n_pca = None, thus a warning will be raised by default as the condition [rank_threshold == 'auto', n_pca not in ['auto',True]] = [True,True].

This is clearly not desirable as most of the time people will set n_pca and not care that the rank_threshold is set to auto. So I can turn off the warning, or I can add some parsing logic. The latest commit is the latter. In the function declaration rank_threshold = None. When it is parsed by Data._parse_n_pca_threshold(), if rank_threshold != 'auto' and n_pca not in ['auto',True] the warning is popped. Otherwise we continue and check if n_pca in ['auto', True], and if it is we default None to 'auto'.
This way we check to see if the user has made any changes before giving them a warning, but we still use 'auto' like a default.

References: I've seen this idea everywhere in the machine learning literature when people talk about low rank matrix completion, rank estimation, low rank factorizations, epsilon approximations... I have not personally read this book, but Numpy cites "Numerical Recipes" here https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.matrix_rank.html

I know that "Numerical Recipes" is a classic reference. It's also the threshold that MATLAB uses by default. I'd also check out "Matrix Computations" by Golub and Van Loan. I added a reference to "Numerical Recipes" in the docstrings.
The basic idea is that SVD introduces artificially nonzero singular values when the matrix is low rank. If we had a perfect algorithm they would be 0 and the problem would be easier. But they're not, so you have to detect those by using numerical precision of your algorithm and data types, scaled by your maximum singular value. By the way, the singular vectors that correspond to these values are orthogonal nonsense.

Ultimately I would like to do something more efficient than this SVD approach, but that will be for a different PR.

I don't know why it's failing builds, it's working locally. I'll talk to Scott

scottgigante · 2019-09-05T20:53:43Z

Thanks for the PR!

stanleyjs added 2 commits September 4, 2019 16:18

authorship changes, new pca interface

47e0b45

see PR

docs

cee443f

scottgigante previously requested changes Sep 4, 2019

View reviewed changes

doc fixes

fd291a6

scottgigante requested changes Sep 5, 2019

View reviewed changes

scottgigante requested a review from dburkhardt September 5, 2019 15:50

refactor of pickle and doc fixes

98dada4

scottgigante approved these changes Sep 5, 2019

View reviewed changes

changes to rank_threshold interface and some doc updates

b915dd6

stanleyjs added 2 commits September 5, 2019 15:24

pickle bug

fd04689

pickle bug

73f52f4

stanleyjs merged commit 2e8d8ff into dev Sep 5, 2019

scottgigante deleted the authors_and_pca_interface branch September 5, 2019 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Authors and pca interface #43

Authors and pca interface #43

stanleyjs commented Sep 4, 2019

coveralls commented Sep 4, 2019 •

edited

Loading

stanleyjs commented Sep 4, 2019

dburkhardt commented Sep 5, 2019

stanleyjs commented Sep 5, 2019 •

edited

Loading

scottgigante commented Sep 5, 2019

Authors and pca interface #43

Authors and pca interface #43

Conversation

stanleyjs commented Sep 4, 2019

coveralls commented Sep 4, 2019 • edited Loading

Pull Request Test Coverage Report for Build 300

💛 - Coveralls

stanleyjs commented Sep 4, 2019

dburkhardt commented Sep 5, 2019

stanleyjs commented Sep 5, 2019 • edited Loading

scottgigante commented Sep 5, 2019

coveralls commented Sep 4, 2019 •

edited

Loading

stanleyjs commented Sep 5, 2019 •

edited

Loading