Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Authors and pca interface #43

Merged
merged 7 commits into from
Sep 5, 2019
Merged

Authors and pca interface #43

merged 7 commits into from
Sep 5, 2019

Conversation

stanleyjs
Copy link
Collaborator

Two things:
1) A change to the PCA interface.

2) A tweak to the authorship in setup.py

1):
There are now essentially three states to n_pca.

If n_pca is a positive integer it behaves as before
If n_pca in [0,None, False], no PCs are used (so it behaves like n_pca == None from previous versions)
If n_pca in [True,'Adaptive'] (and upper/lower_case permutations of adaptive), then n_pca is estimated from the rank of the data.

This estimate is done by first computing all of the singular values, then taking only the PCs above the threshold. This involves instantiating the PCA operator and then tweaking it after the fit.

The threshold is set with the parameter rank_threshold which defaults to rank_threshold = None. Any nonpositive or non-numeric input to rank_threshold defaults to None.
When rank_threshold==None, the threshold is set by (largest singular value of the data) * epsilon * max(data.shape).
If rank_threshold is positive numeric then it just uses that threshold. If n_pca not in [True, 'Adaptive'], then rank_threshold is ignore (a warning is raised if rank_threshold != None and n_pca not in [True,'Adaptive']).

This rank thresholding is one of the proper ways to estimate the rank of a matrix. However, there are a number of other ways to do this that are quicker or have a smaller memory footprint. For one, it is possible to estimate the spectral norm of a matrix and then only compute the singular pairs above that threshold. I have not studied the available sparse/PCA modules to identify if one allows for this iterative approach. I know it is merely a modification of truncated approaches.

Another method would be to use a truncated SVD where one passes the rank in first. To do that we need an estimate of the number of singular values that are above our threshold. numpy does this by just computing all the singular values and counting them. This is akin to what we are doing, except it may be faster as they are not computing the singular vecs. A related way would be a randomized approach to estimate the distribution of the singular values and use that to choose our rank, Finally, there is an sklearn.linalg.interpolative module that handles a lot of these numerical tricks for us. However, I have not gotten their rank estimates to line up with what I expect from unit tests.

Regardless, I have added a bunch of tests for this new interface and functionality.

2)
The previous authors list for the package setup.py was "Jay Stanley, @scottgigante , Krishnaswamy lab, Yale University"
I changed this to " @scottgigante, @dburkhardt, Jay Stanley, Yale University"
I do not really care about the ordering here. I put it in order of my impression of most commits.
It is open source software. Any one can see who commits to this repo and how much they do. At the end of the day I would be happy with just names without affiliations (university or otherwise). If we want/need to add affiliations there is a readme / github page where we can be more verbose.

@coveralls
Copy link

coveralls commented Sep 4, 2019

Pull Request Test Coverage Report for Build 300

  • 69 of 70 (98.57%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 94.356%

Changes Missing Coverage Covered Lines Changed/Added Lines %
graphtools/base.py 67 68 98.53%
Totals Coverage Status
Change from base Build 285: 0.2%
Covered Lines: 953
Relevant Lines: 1010

💛 - Coveralls

graphtools/api.py Outdated Show resolved Hide resolved
graphtools/api.py Outdated Show resolved Hide resolved
graphtools/base.py Outdated Show resolved Hide resolved
graphtools/api.py Outdated Show resolved Hide resolved
graphtools/base.py Outdated Show resolved Hide resolved
graphtools/api.py Show resolved Hide resolved
graphtools/base.py Show resolved Hide resolved
graphtools/base.py Outdated Show resolved Hide resolved
test/test_data.py Show resolved Hide resolved
@stanleyjs
Copy link
Collaborator Author

Think I got everything. We'll wait on @dburkhardt before merging.

graphtools/api.py Outdated Show resolved Hide resolved
graphtools/base.py Outdated Show resolved Hide resolved
graphtools/base.py Outdated Show resolved Hide resolved
graphtools/base.py Outdated Show resolved Hide resolved
graphtools/base.py Outdated Show resolved Hide resolved
graphtools/base.py Outdated Show resolved Hide resolved
graphtools/graphs.py Outdated Show resolved Hide resolved
@dburkhardt
Copy link
Member

This all looks great, thanks for adding this @stanleyjs! A couple of suggestions.

To follow the sklearn API, we should use n_pca = 'auto' or rank_threshold='auto'for algorithmic selection of components/rank.

Passing in nonpositive or non-numeric input to rank_threshold should throw a ValueError instead of defaulting to None.

Can you provide a reference in the docs about the rank_threshold estimation? I'm not familiar with this approach, do you have a citation?

@stanleyjs
Copy link
Collaborator Author

stanleyjs commented Sep 5, 2019

@dburkhardt I think the newest commit handles your desired interface with a caveat.

I wanted to tell users that they're not going to get any threshold when rank_threshold has been set and n_pca not in ['auto', True]. However, if I set 'auto' as the default for rank_threshold, then by default rank_threshold = 'auto', n_pca = None, thus a warning will be raised by default as the condition [rank_threshold == 'auto', n_pca not in ['auto',True]] = [True,True].

This is clearly not desirable as most of the time people will set n_pca and not care that the rank_threshold is set to auto. So I can turn off the warning, or I can add some parsing logic. The latest commit is the latter. In the function declaration rank_threshold = None. When it is parsed by Data._parse_n_pca_threshold(), if rank_threshold != 'auto' and n_pca not in ['auto',True] the warning is popped. Otherwise we continue and check if n_pca in ['auto', True], and if it is we default None to 'auto'.
This way we check to see if the user has made any changes before giving them a warning, but we still use 'auto' like a default.

References: I've seen this idea everywhere in the machine learning literature when people talk about low rank matrix completion, rank estimation, low rank factorizations, epsilon approximations... I have not personally read this book, but Numpy cites "Numerical Recipes" here https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.matrix_rank.html

I know that "Numerical Recipes" is a classic reference. It's also the threshold that MATLAB uses by default. I'd also check out "Matrix Computations" by Golub and Van Loan. I added a reference to "Numerical Recipes" in the docstrings.
The basic idea is that SVD introduces artificially nonzero singular values when the matrix is low rank. If we had a perfect algorithm they would be 0 and the problem would be easier. But they're not, so you have to detect those by using numerical precision of your algorithm and data types, scaled by your maximum singular value. By the way, the singular vectors that correspond to these values are orthogonal nonsense.

Ultimately I would like to do something more efficient than this SVD approach, but that will be for a different PR.

I don't know why it's failing builds, it's working locally. I'll talk to Scott

@stanleyjs stanleyjs merged commit 2e8d8ff into dev Sep 5, 2019
@scottgigante scottgigante deleted the authors_and_pca_interface branch September 5, 2019 20:53
@scottgigante
Copy link
Contributor

Thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants