Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Ascii function for ascii and latin-1 [databricks] #10054

Merged
merged 10 commits into from
Dec 30, 2023

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Dec 14, 2023

Closes #9585

This PR part supported ascii function for strings starting with ASCII and Latin-1 supplement, returning results from 0 to 255. The function is disabled by default.

This PR uses cudf::code_points to get the utf-8 code points of the first letter in a string, and converts to ASCII using the following rules:

ascii utf-8 dec workaround
0~127 0~127 cudf::code_points
128~191 49792~49855 code_points - 49,664
192~255 50048~50111 code_points -49,856

It is good enough for customer, we can file another issue to fully support it when needed.

perf test results

data: 50000000 string from big datagen:

spark.time(df.selectExpr("COUNT(ascii(a)) as asc0", "COUNT(ascii(a)) as asc1", "COUNT(ascii(a)) as asc2", "COUNT(ascii(a)) as asc3", "COUNT(ascii(a)) as asc4", "COUNT(ascii(a)) as asc5", "COUNT(ascii(a)) as asc6", "COUNT(ascii(a)) as asc7", "COUNT(ascii(a)) as asc8", "COUNT(ascii(a)) as asc9").show())
GPU Time (ms) CPU Time (ms) Speed up
1,653 2,787 69%

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven self-assigned this Dec 14, 2023
@sameerz sameerz added the feature request New feature or request label Dec 18, 2023
@thirtiseven
Copy link
Collaborator Author

Did some perf tests, and gpu is slightly faster than cpu (12%). I plan to refine my implementation to reduce times of cuDF API calling, but the undefined values will be more undefined, which doesn't matter for now.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

Ok, I refined the implementation and it now runs faster, about 69% speedup over cpu.

@thirtiseven
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Dec 26, 2023
@thirtiseven
Copy link
Collaborator Author

Test failed in CI because Latin-1 support in Spark is from 3.3.1, will add shim for it so ascii under 3.3.1 will be faster and (maybe) fully supported.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Dec 27, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to be sure that we test this on databricks

@revans2 revans2 changed the title Support Ascii function for ascii and latin-1 Support Ascii function for ascii and latin-1 [databricks] Dec 27, 2023
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Dec 27, 2023
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

@revans2
Copy link
Collaborator

revans2 commented Dec 28, 2023

Looks like we need to fix this for 3.5.1 too

@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven
Copy link
Collaborator Author

perf test results on the byte-based solution (in lower version shims):

GPU Time (ms) CPU Time (ms) Speed up
541.4 2,020.6 3.73x

@revans2
Copy link
Collaborator

revans2 commented Dec 29, 2023

The previous tests timed out trying to run on databricks. Trying again.

@revans2
Copy link
Collaborator

revans2 commented Dec 29, 2023

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] support ascii function
3 participants