Skip to content

Conversation

@sdewaele
Copy link
Contributor

@sdewaele sdewaele commented Nov 29, 2020

Adresses #619.

  • Make transformations work for CuArray vectors. They currently only work CuArray matrices.
  • Find an appropriate home for methods currently defined in "test/cuda.jl" during testing.

This code passes the tests and is ready for merge from my perspective.

src/StatsBase.jl Outdated
function __init__()
@require CUDA="052768ef-5323-5732-b1bb-66c8b64840ba" begin
using .CUDA
function _compute_extrema(X::Union{CuMatrix{<:Real},Adjoint{T,CuMatrix{T}} where T<:Real})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you use this function somewhere? If not, whole __init__ can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used here in the fit(::Type{UnitRangeTransform}, ...) function. The existing implementation does not work (or, is very slow) for CuArrays because the array is indexed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I don't think using Requires is OK for a package like StatsBase, as it will increase load times of many packages. We should find better abstractions so that StatsBase doesn't have to be aware of CUDA.

Couldn't _compute_extrema be replaced with extrema.(eachcol(X))? Would that be enough to use optimized CUDA methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not too happy with using Requires either. I have removed it now. I also replaced _compute_extrema by extrema. The caveat that extrema has not yet been implemented for CUDA arrays, but it is reasonable to expect that it will become available in the future; it is mentioned in JuliaGPU/CUDA.jl#265 (comment)

m, s = mean_and_std(X)
return ZScoreTransform(1, dims, (center ? [m] : zeros(T, 0)),
(scale ? [s] : zeros(T, 0)))
return fit(ZScoreTransform, reshape(X, :, 1);dims=dims, center=center, scale=scale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vector structure of the input input will be lost, and transform will return a 1-column matrix instead of a vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. I will fix this shortly. The issue is that when I stay closer to the original code, I got an inference error. I'll look for an alternate solution to work around that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like transform still works as expected, i.e. it returns a Vector.

julia> X = rand(3)
3-element Vector{Float64}:
 0.0220367
 0.019156
 0.572394

julia> t = fit(ZScoreTransform, X)
ZScoreTransform{Float64, Vector{Float64}}(1, 1, [0.204529], [0.318584])

julia> transform(t, X)
3-element Vector{Float64}:
 -0.572823
 -0.581866
  1.15469

julia> Xt = transform(t, X)
3-element Vector{Float64}:
 -0.572823
 -0.581866
  1.15469

julia> reconstruct(t, Xt)
3-element Vector{Float64}:
 0.0220367
 0.019156
 0.572394

This is achieved through the transform method that is specialized for vectors:

Copy link
Contributor

@wildart wildart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment on allocation, otherwise looks good.

end
dims [1,2] || throw(DomainError(dims, "fit only accept dims to be 1 or 2."))
tmin_tmax = extrema(X; dims=dims)
tmin, tmax = [vec(getindex.(tmin_tmax, i)) for i in 1:2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can form a generator instead of allocating an array.

tmin, tmax = (vec(getindex.(tmin_tmax, i)) for i in 1:2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for the suggestion.

for i = 1:l
@inbounds tmax[i] = 1 / (tmax[i] - tmin[i])
end
dims [1,2] || throw(DomainError(dims, "fit only accept dims to be 1 or 2."))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dims [1,2] || throw(DomainError(dims, "fit only accept dims to be 1 or 2."))
dims (1, 2)] || throw(DomainError(dims, "fit only accept dims to be 1 or 2."))

m, s
end


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored the additional line.

end
dims [1,2] || throw(DomainError(dims, "fit only accept dims to be 1 or 2."))
tmin_tmax = extrema(X; dims=dims)
tmin, tmax = (vec(getindex.(tmin_tmax, i)) for i in 1:2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so it's too bad that this forces copying the result to extract the min and max. Do you think CUDA supports eachslice, or would it be efficient to iterate over rows/columns and call extrema on them? Then we could store directly the result in the right format.

Copy link
Contributor Author

@sdewaele sdewaele Dec 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been experimenting a little bit, but have not found a good solution. This prevents the copying, but does not work with CuArrays because of the element-wise setindex:

l = size(X, dims==1 ? 2 : 1)
tmin, tmax = (similar(X, l) for _ in 1:2)
for (i, x) in enumerate(eachslice(X; dims=dims))
    tmin[i], tmax[i] = extrema(x) # per-element assignment does not work with CUDA
end

Then there is zip + ... but I think the performance of that is even worse because of the additional penalty of ...:

tmin, tmax = collect.(zip((extrema(x) for x in eachslice(X; dims=dims))...))

Also, now the outputs no longer end up in a CuArray which will mean the transform! will fail on CuArray.

An option is to not pursue support for UnitRangeTransform in this PR. Anyway, extrema has not yet been implemented in CUDA.jl. I can reintroduce _compute_extrema which can then later be implemented for CuArrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, too bad. Maybe better not implement this until CUDA supports it. Could you ask for advice there about the best way to achieve this without copying?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked the question on discourse, see here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be worth posting a link on #gpu on Slack.

Copy link
Contributor Author

@sdewaele sdewaele Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest commit I added a new _compute_extrema that eliminates the allocation. With that, it should now be as efficient as the original implementation for regular arrays. Also, this allow for a temporary solution for CuArrays by implementing a specialized _compute_extrema for CuArrays:

function  StatsBase._compute_extrema(X::CuArray; dims)
    tmin = vec(minimum(X; dims=dims))
    tmax = vec(maximum(X; dims=dims))
    return tmin, tmax
end

This is not optimal, but it works. Also, given the compute power of a GPU, I assume it will still give a significant performance improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Though you'll need Compat 2.2 to use eachslice on Julia 1.0.

@sdewaele
Copy link
Contributor Author

sdewaele commented Dec 17, 2020

I suggest to merge the PR now. It will have the following benefits:

  1. ZScoreTransform has been made more generic and works for CuArray
  2. UnitRangeTransform has been made more generic in preparation for using it for CuArray.

ZScoreTransform is anyway more important in my opinion; at least it is probably the more frequently used normalization in machine learning.

After merging this PR request, users can now define _compute_extrema for CuArrays as described to make UnitRangeTransform work. This is not possible with the current definition of UnitRangeTransform because it is not generic enough.

P.S. The CI states that "Some checks were not successful". I don't think this has anything to do with this PR; it has failed on downloading the registry. Probably re-running the test will eliminate the error.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me. Just a few stylistic comments.

sdewaele and others added 3 commits December 19, 2020 11:24
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan nalimilan merged commit 83ccc37 into JuliaStats:master Dec 19, 2020
@nalimilan
Copy link
Member

Thanks!

@sdewaele
Copy link
Contributor Author

You are welcome. Thanks for your thorough review @nalimilan and @wildart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants