[MRG] Fixes numerical issue in GMM.condition #28

AlexanderFabisch · 2021-04-03T13:04:52Z

fix nan priors
add test case that fails without this pull request
another problem: covariance may not always be positive semi-definite, take a look at Oracle Approximating Shrinkage: https://github.com/scikit-learn/scikit-learn/blob/95119c13a/sklearn/covariance/_shrunk_covariance.py#L434
test case for oracle_approximating_shrinkage in GMM.from_samples
apidoc for oracle_approximating_shrinkage
document prior computation in GMM.condition

AlexanderFabisch · 2021-04-04T22:57:36Z

@mralbu would be great if you could take a look at this PR.

mralbu · 2021-04-06T01:35:33Z

Will take a look as soon as I can!

mralbu · 2021-04-14T06:41:03Z

I checked a similar code snippet as in #27 on the proposed fix, running beforehand the following RegressorMixin class definition:

from sklearn.base import BaseEstimator, RegressorMixin, MultiOutputMixin
from sklearn.utils import check_X_y
from sklearn.utils.validation import check_is_fitted, check_array, FLOAT_DTYPES

from gmr import GMM

class GaussianMixtureRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):

    def __init__(self, n_components, priors=None, means=None, covariances=None,
                 verbose=0, random_state=None, R_diff=1e-4, n_iter=500, init_params="random", oracle_approximating_shrinkage=False):
        self.n_components = n_components
        self.priors = priors
        self.means = means
        self.covariances = covariances
        self.verbose = verbose
        self.random_state = random_state
        self.R_diff = R_diff
        self.n_iter = n_iter
        self.init_params = init_params
        self.oracle_approximating_shrinkage = oracle_approximating_shrinkage

    def fit(self, X, y):
        self.gmm_ = GMM(self.n_components, priors=self.priors, means=self.means,
                        covariances=self.covariances, verbose=self.verbose, random_state=self.random_state)

        X, y = check_X_y(X, y, estimator=self.gmm_, dtype=FLOAT_DTYPES, multi_output=True)
        if y.ndim == 1:
            y = np.expand_dims(y, 1)

        self.indices_ = np.arange(X.shape[1])

        self.gmm_.from_samples(np.hstack((X, y)),
                               R_diff=self.R_diff, n_iter=self.n_iter, init_params=self.init_params, oracle_approximating_shrinkage=self.oracle_approximating_shrinkage)
        return self

    def predict(self, X):
        check_is_fitted(self, ["gmm_", "indices_"])
        X = check_array(X, estimator=self.gmm_, dtype=FLOAT_DTYPES)

        return self.gmm_.predict(self.indices_, X)

Code snippet as in #27:

import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_validate

X, y = load_boston(return_X_y=True)

np.random.seed(42)

cross_validate(GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=False), X, y)
>> {'fit_time': array([0.02808738, 0.00212717, 0.00335884, 0.00367188, 0.00159121]),
>>  'score_time': array([0.06866217, 0.06366682, 0.06335163, 0.06228876, 0.06565619]),
>>  'test_score': array([-1.77181999e+03,  7.71024953e-01,  5.81598835e-01,  7.68927960e-02, -1.41784818e+02])}

np.random.seed(42)

cross_validate(GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=True), X, y)
>> {'fit_time': array([0.01825643, 0.0023396 , 0.03778887, 0.012429  , 0.00279951]),
>> 'score_time': array([0.06537795, 0.05936241, 0.05965519, 0.06038404, 0.0568521 ]),
>> 'test_score': array([0.32767343, 0.5339876 , 0.27381149, 0.40139363, 0.38142808])}

def is_pos_def(x):
    return np.all(np.linalg.eigvals(x) >= 0)

np.random.seed(42)

gmr = GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=False)
gmr.fit(X, y)

print([is_pos_def(sigma) for sigma in gmr.gmm_.covariances])
>> [False, True]

np.random.seed(42)

gmr = GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=True)
gmr.fit(X, y)

print([is_pos_def(sigma) for sigma in gmr.gmm_.covariances])
>> [True, True]

With the oracle_approximating_shrinkage algorithm, all the fitted covariances were positive semi-definite.
Cross-validation on the evaluated dataset did not show np.nan or negative numbers as well.

I noticed as well that on the folds that converged in the example with oracle_approximating_shrinkage=False, the R2 metrics were different ([0.771024953, 0.581598835, 0.0768927960] for oracle_approximating_shrinkage=False and [0.5339876 , 0.27381149, 0.40139363] for oracle_approximating_shrinkage=True), but maybe this is expected given that the algorithm is stochastic, right?

That happens as well in the following snippet.

np.set_printoptions(precision=2)

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=True)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [0.62 0.62 0.64 0.64 0.64 0.64 0.59 0.64 0.62 0.64]

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=False)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [  0.74   0.74   0.75  -3.98   0.85 -14.96   0.69   0.74   0.83   0.74]

It seems that, when it converges, oracle_approximating_shrinkage=False, might lead to better R2 metrics.

AlexanderFabisch · 2021-04-14T07:50:51Z

Thank you for your analysis.

I noticed as well that on the folds that converged in the example with oracle_approximating_shrinkage=False, the R2 metrics were different ([0.771024953, 0.581598835, 0.0768927960] for oracle_approximating_shrinkage=False and [0.5339876 , 0.27381149, 0.40139363] for oracle_approximating_shrinkage=True), but maybe this is expected given that the algorithm is stochastic, right?

Yes, of course the initialization of MVNs is stochastic, but even if it wasn't I would expect similar results. I think shrinkage of the covariance has an effect similar to regularization in linear regression (not the same, although that would be interesting to investigate...).

mralbu · 2021-04-14T12:48:00Z

Did some additional comparisons with an experiment using sklearn.mixture.GaussianMixture machinery for fitting the regressor.

from sklego.mixture import GMMRegressor

np.set_printoptions(precision=4)

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478]

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2, init_params='random', max_iter=100)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [0.8157 0.8061 0.8221 0.8152 0.8221 0.8192 0.8479 0.8282 0.8251 0.7792]

Maybe using internal sklearn.mixture machinery might help ease numerical issues, though it would introduce sklearn as a hard dependency and might be out of scope. On the other hand, it would enable the introduction of other regressors such as BayesianGMMRegressor in an easy way, and would have familiar parameters (the same as in sklearn.mixture.GaussianMixture). Do you think exploring the use of sklearn.mixture inner workings would be interesting for gmr?

AlexanderFabisch · 2021-04-14T12:59:01Z

Do you think exploring the use of sklearn.mixture inner workings would be interesting for gmr?

Yes, definitely.

I wouldn't like to introduce sklearn as a required dependency though. You can easily initialize a GMM object from the priors, means, and covariances of a sklearn class, which should make the implementation of BayesianGMMRegressor easy. There are actually some examples that use sklearn's Bayesian GMM.

AlexanderFabisch · 2021-04-14T22:13:21Z

I opened a new issue here: #30

AlexanderFabisch added 4 commits April 3, 2021 15:02

Fixes numerical issue in GMM.condition

b2b9a54

Remove unused line

3ec99a4

Add unit test for numerical issue

9ae6f36

Apply oracle approximating shrinkage

8d18ffa

AlexanderFabisch mentioned this pull request Apr 3, 2021

Scikit-learn RegressionMixin #27

Closed

AlexanderFabisch added 4 commits April 4, 2021 23:48

Document OAS

f8918f8

Test for OAS in training and use n_samples_eff

71317f1

Document prior computation in GMM.condition

68f7e84

Version 1.6.dev

9170774

AlexanderFabisch changed the title ~~[WIP] Fixes numerical issue in GMM.condition~~ [MRG] Fixes numerical issue in GMM.condition Apr 4, 2021

AlexanderFabisch mentioned this pull request Apr 13, 2021

Add GMMRegression scikit-learn RegressorMixin #29

Closed

AlexanderFabisch merged commit 9170774 into master Apr 14, 2021

AlexanderFabisch deleted the fix/condition_numerical_issue branch April 14, 2021 22:04

AlexanderFabisch mentioned this pull request Apr 14, 2021

Numerical problems lead to worse scores in comparison to sklearn #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Fixes numerical issue in GMM.condition #28

[MRG] Fixes numerical issue in GMM.condition #28

AlexanderFabisch commented Apr 3, 2021 •

edited

AlexanderFabisch commented Apr 4, 2021

mralbu commented Apr 6, 2021

mralbu commented Apr 14, 2021

AlexanderFabisch commented Apr 14, 2021

mralbu commented Apr 14, 2021

AlexanderFabisch commented Apr 14, 2021

AlexanderFabisch commented Apr 14, 2021

[MRG] Fixes numerical issue in GMM.condition #28

[MRG] Fixes numerical issue in GMM.condition #28

Conversation

AlexanderFabisch commented Apr 3, 2021 • edited

AlexanderFabisch commented Apr 4, 2021

mralbu commented Apr 6, 2021

mralbu commented Apr 14, 2021

AlexanderFabisch commented Apr 14, 2021

mralbu commented Apr 14, 2021

AlexanderFabisch commented Apr 14, 2021

AlexanderFabisch commented Apr 14, 2021

AlexanderFabisch commented Apr 3, 2021 •

edited