Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add texture support from CuTextures.jl #206

Merged
merged 0 commits into from
Jun 9, 2020
Merged

Add texture support from CuTextures.jl #206

merged 0 commits into from
Jun 9, 2020

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jun 5, 2020

Fixes #46

@cdsousa: I'm still getting to understand your design and the full extent of the relevant CUDA APIs, so I expect to do some work here before merging. Feel free to comment though! I changed some already to be closer to the CUDA C defaults.

@maleadt maleadt added enhancement New feature or request cuda kernels Stuff about writing CUDA kernels. labels Jun 5, 2020
@maleadt maleadt marked this pull request as draft June 5, 2020 14:06
@codecov
Copy link

codecov bot commented Jun 5, 2020

Codecov Report

Merging #206 into master will decrease coverage by 0.33%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #206      +/-   ##
==========================================
- Coverage   82.30%   81.97%   -0.34%     
==========================================
  Files         143      145       +2     
  Lines        9315     9529     +214     
==========================================
+ Hits         7667     7811     +144     
- Misses       1648     1718      +70     
Impacted Files Coverage Δ
src/CUDA.jl 100.00% <ø> (ø)
test/texture.jl 90.74% <90.74%> (ø)
src/texture.jl 94.59% <94.59%> (ø)
examples/wmma/high-level.jl 11.11% <0.00%> (-38.89%) ⬇️
examples/wmma/low-level.jl 14.28% <0.00%> (-35.72%) ⬇️
lib/cuda/occupancy.jl 76.00% <0.00%> (-20.00%) ⬇️
test/device/wmma.jl 0.00% <0.00%> (-7.41%) ⬇️
deps/bindeps.jl 80.00% <0.00%> (-2.00%) ⬇️
test/execution.jl 38.68% <0.00%> (-1.47%) ⬇️
test/device/cuda.jl 9.72% <0.00%> (-0.73%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6666d9...77be342. Read the comment docs.

@cdsousa
Copy link
Contributor

cdsousa commented Jun 5, 2020

Well, the part that I was least satisfied was the _type_to_cuarrayformat_dict ,reconstruct and cast "hacks". It deserves a review to check if the design can get some improvement.

Also, the way that getindex works may be little obvious: real indexes use normalized coordinates and integer indexes use non-normalized coordinates

@maleadt
Copy link
Member Author

maleadt commented Jun 5, 2020

Also, the way that getindex works may be little obvious: real indexes use normalized coordinates and integer indexes use non-normalized coordinates

Didn't you use the call overload (texture(i,j)) for normalized coordinates vs. square brackets for non-normalized ones? Both that or the Real/Int approach seem less than ideal to me, I'd rather use the hardware for that (i.e. configure the texture object as normalized or not, and use the same indexing function). The only difference then is whether to use 1-based indexing or not.

@cdsousa
Copy link
Contributor

cdsousa commented Jun 5, 2020

Didn't you use the call overload (texture(i,j))

Yeah, I used to have that design, it perfectly fine.

I'd rather use the hardware for that (i.e. configure the texture object as normalized or not, and use the same indexing function). The only difference then is whether to use 1-based indexing or not.

That's possible too, we just then need to track that configuration on the host and device side objects, either by storing a flag or by splitting the types into the two versions.

@maleadt
Copy link
Member Author

maleadt commented Jun 5, 2020

Yeah, I used to have that design, it perfectly fine.

Did I start with the wrong code then? Your master branch still has that design: https://github.com/cdsousa/CuTextures.jl/blob/6838fff61ffe32a5bb8344ca1c6e6a3dca3fd16b/src/native.jl#L62-L70

That's possible too, we just then need to track that configuration on the host and device side objects, either by storing a flag or by splitting the types into the two versions.

Yeah that's what I do now, a field CPU-side and a typevar in the device type. But I need to play with the code some more to get a feel about that API :-)

@cdsousa
Copy link
Contributor

cdsousa commented Jun 5, 2020

Did I start with the wrong code then? Your master branch still has that design:

Ahh, sorry for the confusion, I've overlooked into the getindex code you have added and though that it was the design I used to have with a single getindex for both normalized and non-normalized fetches that dispatched on the index type.
I didn't got at first that you had already started working on the code!

@maleadt
Copy link
Member Author

maleadt commented Jun 8, 2020

About the automatic cast/reconstruct/_type_to_cuarrayformat_dict: I had actually been moving away from doing such things automatically (used to so for shuffle, ldg, etc). The compiler does not support it, and the hacks are fragile, hard to infer, and I don't trust that the layout/semantics are always going to be correct. It seems easier to me to admit that CUDA only supports texture memory of limited types, and ask of the user to reinterpret e.g. the RGBA{N0f8} array to an NTuple{4,UInt8}. Thoughts?

@cdsousa
Copy link
Contributor

cdsousa commented Jun 8, 2020

I'm Ok with that, we can always do that explicitly (however I think we cannot reinterpret a CuDeviceArray inside a kernel, can we?).

On the other hand, we will still have an automatic mapping between NTuple{4, Float32} and float4, right? In that case, there must be rationale for choosing NTuple{4, Float32} rather than something else (e.g., SIMD.Vec{4,Float32}).

@maleadt
Copy link
Member Author

maleadt commented Jun 9, 2020

I think we cannot reinterpret a CuDeviceArray inside a kernel, can we?

No; I could add that, but it it seems simpler to reinterpret back when moving data off the GPU?

On the other hand, we will still have an automatic mapping between NTuple{4, Float32} and float4, right? In that case, there must be rationale for choosing NTuple{4, Float32} rather than something else (e.g., SIMD.Vec{4,Float32}).

I'm not entirely sure what you mean, currently using sources of type NTuple{4, T} will construct a 4-channel texture, and accessing that device-side just gives you that NTuple. (thinking of, maybe it would be convenient to allow passing the channel to getindex to get a plain value?)

Copy link
Contributor

@cdsousa cdsousa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, we will still have an automatic mapping between NTuple{4, Float32} and float4, right? In that case, there must be rationale for choosing NTuple{4, Float32} rather than something else (e.g., SIMD.Vec{4,Float32}).

I'm not entirely sure what you mean, currently using sources of type NTuple{4, T} will construct a 4-channel texture, and accessing that device-side just gives you that NTuple. (thinking of, maybe it would be convenient to allow passing the channel to getindex to get a plain value?)

Not a big deal, I was just not sure why an NTuple{4, T} is the Julia-side version of a CUDA float4 (despite the fact that I actually used that equivalence in my original code). I guess that is the usual and right choice.

src/texture.jl Outdated
Base.size(tm::CuTexture) = size(tm.mem)

Adapt.adapt_storage(::Adaptor, t::CuTexture{T,N}) where {T,N} =
CuDeviceTexture{T,N,t.normalized_coordinates}(size(t.mem), t.handle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this bring any type of instability?
I have not used a solution like this in the first place due to the fear of type instability, but now I guess that maybe the kernel launch works as a function barrier, is that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Texture memory?
2 participants