Sl/numba tauchen #227

merged 2 commits into from Feb 2, 2016


None yet

6 participants

sglyon commented Jan 31, 2016

I needed a faster version of tauchen, so I dropped the loops into numba.

Here are the before timings (on the actual size problem I'm working on in the application):

In [21]: timeit tauchen(0.95, 0.1, 3, 451)
1 loops, best of 3: 32.4 s per loop

After timings:

In [4]: timeit qe.tauchen(0.95, 0.1, 3, 451)
10 loops, best of 3: 65.4 ms per loop

Not quite as quick as the Julia, but much closer (julia timings below).

julia> @benchmark tauchen(451, 0.95, 0.1, 0.0, 3)
================ Benchmark Results ========================
     Time per evaluation: 28.28 ms [27.11 ms, 29.45 ms]
Proportion of time in GC: 0.51% [0.00%, 1.71%]
        Memory allocated: 1.55 mb
   Number of allocations: 3 allocations
       Number of samples: 100
   Number of evaluations: 100
 Time spent benchmarking: 3.04 s
oyamad commented Jan 31, 2016

Speedup very impressive.

I have a not-so-important comment. In the loop, x[j] - rho * x[i] +/- half_step is computed many times, but isn't it enough to compute once and increment by step each time?

Something like:

def _fill_tauchen(x, P, n, rho, sigma, step):
    for i in range(n):
        w = x[0] - rho * x[i] + step / 2
        P[i, 0] = std_norm_cdf(w / sigma)

        for j in range(1, n - 1):
            P[i, j] = -std_norm_cdf(w / sigma)
            w += step
            P[i, j] += std_norm_cdf(w / sigma)

        P[i, n-1] = 1 - std_norm_cdf(w / sigma)


def _fill_tauchen(x, P, n, rho, sigma, step):
    step_sigma = step / sigma
    for i in range(n):
        w = (x[0] - rho * x[i] + step / 2) / sigma
        P[i, 0] = std_norm_cdf(w)

        for j in range(1, n - 1):
            P[i, j] = -std_norm_cdf(w)
            w += step_sigma
            P[i, j] += std_norm_cdf(w)

        P[i, n-1] = 1 - std_norm_cdf(w)

(Speedup may be negligible though.)

albop commented Jan 31, 2016

The performance gains seem indeed quite good. There is probably no point in looking for further improvements, since it is unlikely that tauchen will be repeatedly called (ou can always scale the normal).

I still did a bit of digging. Most performance costs seem to come from the erfc function (I replaced it by identity to check). Maybe the difference in the implementations of this routine explain the difference. In numba the implementation of erfc is the one from python c ( The one used in julia is apparently the same as in the glibc.
llvm 3.7 to which numba is currently switching has incorporated erfc so the situation may be easy to improve on the numba side.

cc7768 commented Feb 1, 2016

I agree with @albop that it is probably futile to look for more performance gains, but fwiw this was faster on my computer (but only marginally and we lose the benefit of the function being jitted).

import numpy as np
import scipy.stats as st

def tauchen(n, m, ybar, sigma, rho=0.0):
    Takes as inputs n, m, ybar, sigma_e and produces a markov chain
    approximation to the process :
    y_t = \bar{y} + \rho y_{t-1} + \varepsilon_t
    where \varepsilon_t is i.i.d. normal of mean 0, std dev of sigma

    n : int
        The number of points to approximate the distribution

    m : int
        The number of standard deviations to go away from the mean
        in each direction

    ybar : float
        The value \bar{y} in the process.  Note that the mean of this
        AR(1) process, y, is simply ybar/(1 - rho)

    sigma : float
        The value of the standard deviation of the \varepsilon process

    rho : float
        By default this will be 0, but if you are approximating an AR(1)
        process then this is the autocorrelation across periods

    bar : 1d array of floats
        The possible states for the markov chain

    pi : 2d array of floats
        The transition probabilities across states


    # Get the standard deviation of y
    y_sd = np.sqrt(sigma**2/(1 - rho**2))

    # Take go out m standard deviations and create equally spaced points
    # from -m*y_sd to m*y_sd, note d is distance between points
    ubar = m*y_sd
    lbar = -ubar
    bar, d = np.linspace(lbar, ubar, n, retstep=True)

    # Create matrix of values
    barmat = np.repeat(bar, n).reshape(n, n)

    barup = (barmat.T + d/2. - rho*barmat)/sigma
    bardo = (barmat.T - d/2. - rho*barmat)/sigma

    pi = st.norm.cdf(barup) - st.norm.cdf(bardo)
    pi[:, 0] = st.norm.cdf(barup[:, 0])
    pi[:, -1] = 1. - st.norm.cdf(bardo[:, -1])

    bar += ybar/(1. - rho)

    return bar, pi
sglyon commented Feb 1, 2016

The erfc function must be quite a bit slower in python than whatever Julia is using because the algorithm is exactly the same in both languages: fill in rows at a time. This will not be consistent with Julia's column major arrays so we are likely getting a performance hit by not "transposing" the algorithm in Julia.

Anyway, I would be happy merging either this or Chase's function. I'm indifferent between the two.

Also agree with @albop that searching for milliseconds here probably doesn't need to be a priority. I just didn't like waiting 35+ seconds to initialize a model when I knew it could take under a second ;)

cc7768 commented Feb 1, 2016

I have slight preference for the numbaized function. Cleaner and the difference is pretty negligible (and sounds like it will get better for free).

albop commented Feb 1, 2016

Yes, @cc7768, it will get better over time.
Just for fun: one can use the libm function directly using:

from cffi import FFI
ffi = FFI()
ffi.cdef('double erfc(double x);')
libm = ffi.dlopen("m")
erfc = libm.erfc

On my machine, this test produces the following timing:

cpython: 2.5553548336029053
libm: 0.09564900398254395

It is a bit more than 20 times faster than the cpython version !

albop commented Feb 1, 2016

With the libm version of erfc the numbaized version goes from 88ms to 25ms on my computer.

Just to be clear: my vote goes to keep the numbaized version exactly as it is and not bother with including more optimization.
This is essentially a numba "bug". There has been an ongoing disussion about the relative performances of math. and numpy. functions here numba/numba#1616 (see also: We should just make sure that they are aware of potential differences with libm.

sglyon commented Feb 1, 2016

Thanks for checking out those benchmarks.

I think that as long as @jstac is ok with this, we can merge as is

jstac commented Feb 1, 2016

Thanks all for the discussion. I've been following and I've learned from reading it. It makes sense to Numba-fy the routine and I agree that we don't need to search for the last drop of speed here.

As a general comment for this PR and others, you guys are in charge of and QE.jl so please feel free to merge without my input. They are really your libraries now, and you are better equipped to take the lead than I am.

If it's a change that affects the lectures, I'd appreciate if you let me know (although it's not actually your responsibility, and in any case I follow the discussions in the PRs). It will be the job of Tom and I to keep the lectures up to date with changes that you guys make to the libraries.

For QuantEcon.applications, Tom and I still need to take charge of changes there, since that repo just serves the lectures at this stage.

@sglyon sglyon merged commit 9ff954c into master Feb 2, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
sglyon commented Feb 2, 2016

Sounds great, thanks @jstac!

mmcky commented Feb 2, 2016

@spencerlyon2 thanks for this update. Do you mind if we delete the branch?

sglyon commented Feb 2, 2016

Not at all. Let's keep things tidy

// Spencer
On Feb 1, 2016 21:53, "mmcky" wrote:

@spencerlyon2 thanks for this update.
Do you mind if we delete the branch?

Reply to this email directly or view it on GitHub
#227 (comment)

@mmcky mmcky deleted the sl/numba-tauchen branch Feb 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment