Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP : Remove smirnov entries from special/_legacy.pxd #9089

Closed

Conversation

pvanmulbregt
Copy link
Contributor

(WIP/Example)
@person142 has noted in #8737 (review) that adding unsafe versions of functions to special/_legacy.pxd is unwanted.
This work/example removes the smirnov and smirnovi entries to understand the effect.

After removing smirnov entries from _legacy.pxd and functions.json, two tests fail with TypeError Exceptions.

Test Failure 1:
scipy/stats/tests/test_continuous_basic.py: test_rvs_broadcast fails as an array constructed from the distribution's shape values is not given the type of the value. test_rvs_broadcast is a generic routine used to test 100+ distributions. It gets the shape parameters from scipy/stats/_distr_params.py
I.e. an array of float is created even though the first element of the shape maybe an int. Attempting to pass the float into smirnov(int n, double e) fails.

This test can be made to pass. The change to stats/tests/test_continuous_basic.py in this PR uses the type of the shape_args element when creating the array of ones. I.e.
allargs.append(shape_args[k]*np.ones(shp))
->
allargs.append(shape_args[k]*np.ones(shp, dtype=type(shape_args[k])))

This would affect the invocation of test_rvs_broadcast for all 100_ distributions. None of the current list of distributions in _distr_params.py:distcont fail. Any reason to think that this change would not be safe to apply, independent of gh-8737?

Modify a generic test framework to respect type of distribution shape
arguments. Allowed removal of smirnov entries from _legacy.pxd and
functions.json in special.
Still fails one test test_ksone_fit_freeze in
scipy/stats/tests/test_distributions.py.
@pvanmulbregt
Copy link
Contributor Author

pvanmulbregt commented Jul 29, 2018

Test Failure 2:
scipy/stats/tests/test_distributions.py.: test_ksone_fit_freeze fails.
It appears that @stefanv added this test in 2012
f7083b6 with title "TST: stats: Add regression test for ticket 1638." (Aside: Does "ticket 1638" refer to
gh-1638="optimize.curve_fit does not handle None cov_x from optimize.leastsq properly (Trac #1111)"
or "cephes_smirnov nan-to-int conversion bug (Trac #1638)"?)

It fails trying to construct a moment because the first arg ('n') passed in has a float type, not an int type. (See below.)

What exactly is this function test_ksone_fit_freeze trying to test?
Is it valid to call stats.<DIST>.fit(data) if distribution has one or more integer (i.e. not float) shape arguments?

scipy/stats/tests/test_distributions.py:2882 (test_ksone_fit_freeze)
def test_ksone_fit_freeze():
        # Regression test for ticket #1638.
        d = np.array(
            [-0.18879233, 0.15734249, 0.18695107, 0.27908787, -0.248649,
             -0.2171497, 0.12233512, 0.15126419, 0.03119282, 0.4365294,
              0.08930393, -0.23509903, 0.28231224, -0.09974875, -0.25196048,
              0.11102028, 0.1427649, 0.10176452, 0.18754054, 0.25826724,
              0.05988819, 0.0531668, 0.21906056, 0.32106729, 0.2117662,
              0.10886442, 0.09375789, 0.24583286, -0.22968366, -0.07842391,
             -0.31195432, -0.21271196, 0.1114243, -0.13293002, 0.01331725,
             -0.04330977, -0.09485776, -0.28434547, 0.22245721, -0.18518199,
             -0.10943985, -0.35243174, 0.06897665, -0.03553363, -0.0701746,
             -0.06037974, 0.37670779, -0.21684405])
    
        try:
            olderr = np.seterr(invalid='ignore')
            with suppress_warnings() as sup:
                sup.filter(IntegrationWarning,
                           "The maximum number of subdivisions .50. has been "
                           "achieved.")
                sup.filter(RuntimeWarning,
                           "floating point number truncated to an integer")
>               stats.ksone.fit(d)

d          = array([-0.18879233,  0.15734249,  0.18695107,  0.27908787, -0.248649  ,
       -0.2171497 ,  0.12233512,  0.15126419, ...
       -0.10943985, -0.35243174,  0.06897665, -0.03553363, -0.0701746 ,
       -0.06037974,  0.37670779, -0.21684405])
olderr     = {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
sup        = <numpy.testing.nose_tools.utils.suppress_warnings object at 0x10a20ef60>

test_distributions.py:2905: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../_distn_infrastructure.py:2188: in fit
    start = self._fitstart(data)
../_distn_infrastructure.py:2027: in _fitstart
    loc, scale = self._fit_loc_scale_support(data, *args)
../_distn_infrastructure.py:2240: in _fit_loc_scale_support
    loc_hat, scale_hat = self.fit_loc_scale(data, *args)
../_distn_infrastructure.py:2302: in fit_loc_scale
    mu, mu2 = self.stats(*args, **{'moments': 'mv'})
../_distn_infrastructure.py:1025: in stats
    mu = self._munp(1, *goodargs)
../_distn_infrastructure.py:783: in _munp
    vals = self.generic_moment(n, *args)
/Users/Paul/anaconda/envs/py36scipydev/lib/python3.6/site-packages/numpy/lib/function_base.py:2755: in __call__
    return self._vectorize_call(func=func, args=vargs)
/Users/Paul/anaconda/envs/py36scipydev/lib/python3.6/site-packages/numpy/lib/function_base.py:2831: in _vectorize_call
    outputs = ufunc(*inputs)
../_distn_infrastructure.py:1609: in _mom1_sc
    return integrate.quad(self._mom_integ1, 0, 1, args=(m,)+args)[0]
../../integrate/quadpack.py:341: in quad
    points)
../../integrate/quadpack.py:448: in _quad
    return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
../_distn_infrastructure.py:1606: in _mom_integ1
    return (self.ppf(q, *args))**m
../_distn_infrastructure.py:1919: in ppf
    place(output, cond, self._ppf(*goodargs) * scale + loc)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <scipy.stats._continuous_distns.ksone_gen object at 0x109d8f828>
q = array([0.5]), n = array([1.])

    def _ppf(self, q, n):
>       return sc.smirnovi(n, 1.0 - q)
E       TypeError: ufunc 'smirnovi' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

n          = array([1.])
q          = array([0.5])
self       = <scipy.stats._continuous_distns.ksone_gen object at 0x109d8f828>

../_continuous_distns.py:48: TypeError

@pv
Copy link
Member

pv commented Jul 29, 2018

fit() for distributions does not appear to handle integer parameters, and it e.g. returns non-integers.
Addressing this issue probably requires deciding what to do with this. If the current behavior is as desired, the distribution could override its _fitstart and _nnlf_and_penalty to implement the integer-truncation explicitly.

With the "legacy" behavior of flooring integer parameters, you probably get some semi-reasonable result from the fit --- the default optimizer is nelder-mead which probably is not terribly misled.

@josef-pkt
Copy link
Member

#2163 #187

two reasons why int are not nice, in general not specific to smirnov:

  • There is no nan for an int dtype. In the old times nan were converted to zero when casting to int. Since we need nan handling in the stats distributions, the default is to use floats even if they are integer valued.
  • Some function, specifically fit and bisection methods in optimize only work with floats. To use those for integer valued function, it is convenient to not have a dtype check that breaks those functions, i.e. raises an exception.

Also, it is not clear which function or whether a function has an extension to real numbers, e.g. degrees of freedom in F, t and chi2 distributions don't need to be int. Many discrete distribution parameters extend to floats.
My impression from some experiments (which I don't remember) was that some special function extended to the real line even if the non-integer values where just a monotonic extension that might not have a substantive interpretation.

@pvanmulbregt
Copy link
Contributor Author

[There are actually more than two tests failing in the CI runs. The failures internal to special are not so interesting for this example, as they can be dealt with as in gh-8737 by a modification to FuncData(...).check() to pass in the types. It is the failures in stats, the failures external to special, that are of more interest.]

For smirnov(n, e), it doesn't currently accept n=np.nan or n=np.inf. finite floats are truncated to the nearest integer.
There may be an extension of smirnov(n, e) from n \in Z to n \in R+, though the natural definition and algorithm is for positive integer n. The asymptotic formulae might computable for positive real n, but I'm sure that it is not what would be calculated if smirnov(n, e) were to be passed a real non-integral n.
I looked at gh-2163 and gh-187. They seem more concerned with e=nan, rather than n=nan.

My observation on fit is that it has had unintended consequences elsewhere (E.g. gh-7491.)
Is there a use-case for finding the best-fitting smirnov distribution to some data? Is the data in test_ksone_fit_freeze() real or synthetic? I.e. Do we need to preserve it?

@josef-pkt
Copy link
Member

I don't have the latest scipy installed, I get what I would expect. Although it would be better if the return for np.inf and other large n were the asymptotic value, if that is possible.

>>> scipy.__version__
'0.18.1'
>>> special.smirnov(np.array([6.1, 6, 20.1, 20, np.nan, np.inf]), 0.05)
array([ 0.93618592,  0.93618592,  0.87365249,  0.87365249,         nan,
               nan])

I'm mainly in favor of a general policy, where the ufuncs in scipy special handle those cases. This is more convenient and makes is easier to handle many distributions consistently in scipy.stats and in other packages.
The flooring could also be done in the _cdf and other methods in the scipy stats distributions. But AFAIK there are currently no cases where a specific dtype different from float is required. (Some distributions like norm.cdf cannot handle complex dtype.)

If I remember correctly (I haven't looked at it in a long time), then the asymptotic distribution for the one-sided kstest has a simple form that would not have a int requirement (n=np.inf IIUC)

Related to extension to positive real line: scipy stats distribution uses gamma (or more often gammaln) instead of factorial so we have the results for the real line. AFAIK, we replaced all (?) use of factorial in scipy.stats.

Is there a use-case for finding the best-fitting smirnov distribution to some data? Is the data in test_ksone_fit_freeze() real or synthetic?

The bug was reported from pymvpa. pymvpa was (and maybe still is) fitting all distributions to find a candidate distribution that is closest to the data. AFAIK, the fit unit tests in scipy.stats.distributions still use a blacklist for distributions where fit is not attempted.

I don't know of a usecase where I would fit smirnov, but that doesn't mean that it might not show up.
A more obvious usecase for using optimize with the kstest related distributions is for sample size and power computation, e.g. I use the univariate rootfinders from scipy.special in statsmodels for this. But that doesn't include power or sample size computation for kstest yet.

I hadn't looked at gh-7491 (too close to or at the beginning of my vacation last year). I don't see how that is related. For many distributions boundary values for the parameters are explicitly excluded and return or should return nan in those cases. There have been changes over the years to the behavior at the boundaries, but most likely the code for those cases is still not completely "clean" for distributions that are not used much.

@pvanmulbregt
Copy link
Contributor Author

If the test data is from real data then it is a real use case. [Some distributions are blacklisted for testing but this particular test is an individual test specifically targeted at ksone/smirnov.] It is a little unusual, in that the return is almost a uniform distribution, but if that's what was requested...

>>> d = np.array(
            [-0.18879233, 0.15734249, 0.18695107, 0.27908787, -0.248649,
             -0.2171497, 0.12233512, 0.15126419, 0.03119282, 0.4365294,
              0.08930393, -0.23509903, 0.28231224, -0.09974875, -0.25196048,
              0.11102028, 0.1427649, 0.10176452, 0.18754054, 0.25826724,
              0.05988819, 0.0531668, 0.21906056, 0.32106729, 0.2117662,
              0.10886442, 0.09375789, 0.24583286, -0.22968366, -0.07842391,
             -0.31195432, -0.21271196, 0.1114243, -0.13293002, 0.01331725,
             -0.04330977, -0.09485776, -0.28434547, 0.22245721, -0.18518199,
             -0.10943985, -0.35243174, 0.06897665, -0.03553363, -0.0701746,
             -0.06037974, 0.37670779, -0.21684405])
>>> scipy.stats.ksone.fit(d)
(1.0000004344985516, 0.0035637742302959121, 0.43329585129437048)

For n=np.inf, the ksone/Smirnov CDF is just the Heaviside step function, because the domain is unscaled differences. For large non-infinite n, the asymptotic can be used to simplify the calculation.
For non-integral n, I think there may be a formulation which could be used to extend from Z+ to R+. But it is not used in gh-8737, as that explicitly raises to an integer power.

Re gh-7491: If I recall correctly fit maximizes the probability of the data over all possible shape args by minimizing the sum of neg log probs. It uses "PDF(x|shape)" as a proxy for "Prob(observation=x|shape)", and that causes problems if the PDF is 0 or infinite. The _support_mask was added to help deal with this problem, but that broke the PDF for some families of distributions.

@pvanmulbregt
Copy link
Contributor Author

Closing as removing smirnov entries from _legacy.pxd would break an existing test on real user data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants