-
Notifications
You must be signed in to change notification settings - Fork 194
more generic ZScoreTransform, UnitRangeTranform to support CuArrays #622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more generic ZScoreTransform, UnitRangeTranform to support CuArrays #622
Conversation
…ependency using Requires
src/StatsBase.jl
Outdated
| function __init__() | ||
| @require CUDA="052768ef-5323-5732-b1bb-66c8b64840ba" begin | ||
| using .CUDA | ||
| function _compute_extrema(X::Union{CuMatrix{<:Real},Adjoint{T,CuMatrix{T}} where T<:Real}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you use this function somewhere? If not, whole __init__ can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used here in the fit(::Type{UnitRangeTransform}, ...) function. The existing implementation does not work (or, is very slow) for CuArrays because the array is indexed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I don't think using Requires is OK for a package like StatsBase, as it will increase load times of many packages. We should find better abstractions so that StatsBase doesn't have to be aware of CUDA.
Couldn't _compute_extrema be replaced with extrema.(eachcol(X))? Would that be enough to use optimized CUDA methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not too happy with using Requires either. I have removed it now. I also replaced _compute_extrema by extrema. The caveat that extrema has not yet been implemented for CUDA arrays, but it is reasonable to expect that it will become available in the future; it is mentioned in JuliaGPU/CUDA.jl#265 (comment)
src/transformations.jl
Outdated
| m, s = mean_and_std(X) | ||
| return ZScoreTransform(1, dims, (center ? [m] : zeros(T, 0)), | ||
| (scale ? [s] : zeros(T, 0))) | ||
| return fit(ZScoreTransform, reshape(X, :, 1);dims=dims, center=center, scale=scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vector structure of the input input will be lost, and transform will return a 1-column matrix instead of a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. I will fix this shortly. The issue is that when I stay closer to the original code, I got an inference error. I'll look for an alternate solution to work around that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like transform still works as expected, i.e. it returns a Vector.
julia> X = rand(3)
3-element Vector{Float64}:
0.0220367
0.019156
0.572394
julia> t = fit(ZScoreTransform, X)
ZScoreTransform{Float64, Vector{Float64}}(1, 1, [0.204529], [0.318584])
julia> transform(t, X)
3-element Vector{Float64}:
-0.572823
-0.581866
1.15469
julia> Xt = transform(t, X)
3-element Vector{Float64}:
-0.572823
-0.581866
1.15469
julia> reconstruct(t, Xt)
3-element Vector{Float64}:
0.0220367
0.019156
0.572394This is achieved through the transform method that is specialized for vectors:
wildart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment on allocation, otherwise looks good.
src/transformations.jl
Outdated
| end | ||
| dims ∈ [1,2] || throw(DomainError(dims, "fit only accept dims to be 1 or 2.")) | ||
| tmin_tmax = extrema(X; dims=dims) | ||
| tmin, tmax = [vec(getindex.(tmin_tmax, i)) for i in 1:2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can form a generator instead of allocating an array.
tmin, tmax = (vec(getindex.(tmin_tmax, i)) for i in 1:2)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for the suggestion.
src/transformations.jl
Outdated
| for i = 1:l | ||
| @inbounds tmax[i] = 1 / (tmax[i] - tmin[i]) | ||
| end | ||
| dims ∈ [1,2] || throw(DomainError(dims, "fit only accept dims to be 1 or 2.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dims ∈ [1,2] || throw(DomainError(dims, "fit only accept dims to be 1 or 2.")) | |
| dims ∈ (1, 2)] || throw(DomainError(dims, "fit only accept dims to be 1 or 2.")) |
| m, s | ||
| end | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restored the additional line.
src/transformations.jl
Outdated
| end | ||
| dims ∈ [1,2] || throw(DomainError(dims, "fit only accept dims to be 1 or 2.")) | ||
| tmin_tmax = extrema(X; dims=dims) | ||
| tmin, tmax = (vec(getindex.(tmin_tmax, i)) for i in 1:2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so it's too bad that this forces copying the result to extract the min and max. Do you think CUDA supports eachslice, or would it be efficient to iterate over rows/columns and call extrema on them? Then we could store directly the result in the right format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been experimenting a little bit, but have not found a good solution. This prevents the copying, but does not work with CuArrays because of the element-wise setindex:
l = size(X, dims==1 ? 2 : 1)
tmin, tmax = (similar(X, l) for _ in 1:2)
for (i, x) in enumerate(eachslice(X; dims=dims))
tmin[i], tmax[i] = extrema(x) # per-element assignment does not work with CUDA
endThen there is zip + ... but I think the performance of that is even worse because of the additional penalty of ...:
tmin, tmax = collect.(zip((extrema(x) for x in eachslice(X; dims=dims))...))Also, now the outputs no longer end up in a CuArray which will mean the transform! will fail on CuArray.
An option is to not pursue support for UnitRangeTransform in this PR. Anyway, extrema has not yet been implemented in CUDA.jl. I can reintroduce _compute_extrema which can then later be implemented for CuArrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, too bad. Maybe better not implement this until CUDA supports it. Could you ask for advice there about the best way to achieve this without copying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked the question on discourse, see here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also be worth posting a link on #gpu on Slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest commit I added a new _compute_extrema that eliminates the allocation. With that, it should now be as efficient as the original implementation for regular arrays. Also, this allow for a temporary solution for CuArrays by implementing a specialized _compute_extrema for CuArrays:
function StatsBase._compute_extrema(X::CuArray; dims)
tmin = vec(minimum(X; dims=dims))
tmax = vec(maximum(X; dims=dims))
return tmin, tmax
endThis is not optimal, but it works. Also, given the compute power of a GPU, I assume it will still give a significant performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Though you'll need Compat 2.2 to use eachslice on Julia 1.0.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
…chslice (requires 1.1)
|
I suggest to merge the PR now. It will have the following benefits:
After merging this PR request, users can now define P.S. The CI states that "Some checks were not successful". I don't think this has anything to do with this PR; it has failed on downloading the registry. Probably re-running the test will eliminate the error. |
nalimilan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me. Just a few stylistic comments.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
Thanks! |
|
You are welcome. Thanks for your thorough review @nalimilan and @wildart. |
Adresses #619.
This code passes the tests and is ready for merge from my perspective.