Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding cast string to float kernel #631

Merged
merged 17 commits into from
Oct 21, 2022

Conversation

hyperbolic2346
Copy link
Collaborator

@hyperbolic2346 hyperbolic2346 commented Oct 12, 2022

This adds a string to float kernel to improve performance of this conversion.

This is one of the casting kernels for NVIDIA/spark-rapids#5639.

Closes #632

@hyperbolic2346 hyperbolic2346 added the enhancement New feature or request label Oct 12, 2022
@hyperbolic2346 hyperbolic2346 self-assigned this Oct 12, 2022
Signed-off-by: Mike Wilson <knobby@burntsheep.com>
@hyperbolic2346
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have someone with more CUDA experience review cast_string_to_float.cu.

src/main/cpp/src/cast_string.hpp Outdated Show resolved Hide resolved
src/main/cpp/src/cast_string.hpp Outdated Show resolved Hide resolved
src/main/cpp/src/cast_string.hpp Outdated Show resolved Hide resolved
src/main/java/com/nvidia/spark/rapids/jni/CastStrings.java Outdated Show resolved Hide resolved
hyperbolic2346 and others added 2 commits October 12, 2022 21:12
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
@ttnghia ttnghia self-requested a review October 13, 2022 02:17
@@ -144,6 +144,7 @@ add_library(
src/CastStringJni.cpp
src/NativeParquetJni.cpp
src/cast_string.cu
src/cast_string_to_float.cu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is cast_string.cu? This new file name is more meaningful, thus we may want to rename the old file to make it more meaningful too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally the plan was to put all the casting kernels in cast_string.cu, but at this point we realized that it would simply be too big. This one is split out now and the original kernels will split out as well in another PR.

Comment on lines 60 to 69
__device__ string_to_float(T* _out,
bitmask_type* _validity,
int32_t* _ansi_except,
size_type* _valid_count,
const char* const _chars,
offset_type const* _offsets,
int _warp_id,
uint64_t const* const _ipow,
bitmask_type const* _incoming_null_mask,
size_type const _num_rows)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel something's wrong: This functor takes too many parameters. It would be great if we can decouple them by using multiple functors instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design choice was to use a functor to prevent passing so many variables around to each function, but that leaves us passing it all in at the beginning to feed into that object. I don't know how much we could free up as this is a single kernel invocation for speed reasons.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, we can pull the base functor into this function and then call the private functions with parameters, but it feels like too much for this function. If we push that out to a function we're back in the same boat as now as we have to pass all that into that next function.


TYPED_TEST(StringToFloatTests, ANSIInvalids)
{
cudf::test::strings_column_wrapper in[] = {{"A"}, {"."}, {"e"}};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should add more cases for the ansi test. Here are only very simple cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests from the plugin that failed to ensure they didn't regress. The plugin tests are VERY extensive.

Copy link
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally I have read through the code. My head was about to explode without calling to any xxx::explode.

@hyperbolic2346
Copy link
Collaborator Author

Performance information added to spark-rapids PR since the original code path was entirely inside that code base.

@hyperbolic2346
Copy link
Collaborator Author

build

}
CUDF_EXPECTS(dtype == data_type{type_id::FLOAT32} || dtype == data_type{type_id::FLOAT64},
"invalid float data type");
if (string_col.size() == 0) { return std::make_unique<column>(); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably empty column of floating point type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I should. Thanks.

@ttnghia
Copy link
Collaborator

ttnghia commented Oct 21, 2022

Please be aware of rapidsai/cudf#11875. You may need to change the code in this PR (and throughout this repo, in another separate PR).

Given that PR is ready to merge, I'll probably approve this after that is propagated here.

ttnghia
ttnghia previously approved these changes Oct 21, 2022
Copy link
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed my mind. Please finalize this and merge it ASAP then make a separate PR to adopt that breaking change.

@hyperbolic2346
Copy link
Collaborator Author

build

@hyperbolic2346 hyperbolic2346 merged commit 6e737a9 into NVIDIA:branch-22.12 Oct 21, 2022
@hyperbolic2346 hyperbolic2346 deleted the mwilson/float_cast branch October 21, 2022 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support cast string to float
5 participants