Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ORC tests to run with wider range of timestamp input #6545

Merged
merged 7 commits into from
Sep 20, 2022

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Sep 12, 2022

This reverses the changes for orc_test.py#test_read_with_more_columns that have been made in #6286 due to a bug in cudf's orc reader. The bug is fixed by rapidsai/cudf#11586.

Closes #6312.

Depends on:

CI will not pass until they are all merged and spark-rapids-jni is updated with it.

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
@ttnghia ttnghia added bug Something isn't working test Only impacts tests P1 Nice to have for release cudf_dependency An issue or PR with this label depends on a new feature in cudf labels Sep 12, 2022
@ttnghia ttnghia self-assigned this Sep 12, 2022
jlowe
jlowe previously approved these changes Sep 12, 2022
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here look OK, although there are more tests to cleanup after the ORC timestamp fix. See get_orc_timestamp_gen in orc_test.py and _restricted_timestamp in hive_write_test.py. Can be done in a followup PR, but it seems very relevant to the theme of this PR.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 12, 2022

Sure. I'll update all these related tests too.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 13, 2022

I was working on the timestamp test in ORC as indicated by @jlowe and discovered a new bug with wrong read/write day number (rapidsai/cudf#11691). Thus, this PR should be merged as-is without the suggested changes. They will be addressed later when the issue rapidsai/cudf#11691 is fixed.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 13, 2022

build

@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 13, 2022

CI failed with unrelated tests:

[2022-09-13T05:34:02.621Z] - transpiler detects invalid cuDF patterns *** FAILED ***
[2022-09-13T05:34:02.621Z]   cuDF unexpectedly compiled expression: 	+|a (RegularExpressionTranspilerSuite.scala:62)
[2022-09-13T05:34:02.622Z] - Detect unsupported combinations of line anchors and \W and \D
[2022-09-13T05:34:02.622Z] - cuDF does not support choice with repetition
[2022-09-13T05:34:02.622Z] - cuDF does not support terms ending with line anchors on one side of a choice
[2022-09-13T05:34:02.622Z] - cuDF unsupported choice cases *** FAILED ***
[2022-09-13T05:34:02.622Z]   Expected exception ai.rapids.cudf.CudfException to be thrown, but no exception was thrown (RegularExpressionTranspilerSuite.scala:107)
[2022-09-13T05:34:02.622Z] - sanity check: choice edge case 2 *** FAILED ***
[2022-09-13T05:34:02.622Z]   Expected exception ai.rapids.cudf.CudfException to be thrown, but no exception was thrown (RegularExpressionTranspilerSuite.scala:115)

@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 13, 2022

build

1 similar comment
@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 13, 2022

build

@jlowe
Copy link
Member

jlowe commented Sep 13, 2022

They will be addressed later when the issue rapidsai/cudf#11691 is fixed.

It appears that new issue is actually an older issue related to the Gregorian calendar switch. That was avoided previously by not generating timestamps before the Gregorian calendar conversion, for example see https://github.com/NVIDIA/spark-rapids/blob/branch-21.08/integration_tests/src/main/python/orc_test.py#L51 from 22.08. I'm fine with not fixing it here, but if we leave it for a followup, we need an issue to track.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 13, 2022

It appears that new issue is actually an older issue related to the Gregorian calendar switch. That was avoided previously by not generating timestamps before the Gregorian calendar conversion, for example see https://github.com/NVIDIA/spark-rapids/blob/branch-21.08/integration_tests/src/main/python/orc_test.py#L51 from 22.08. I'm fine with not fixing it here, but if we leave it for a followup, we need an issue to track.

Did we have any issue tracking that in either cudf or here?

@jlowe
Copy link
Member

jlowe commented Sep 13, 2022

Did we have any issue tracking that in either cudf or here?

Yes, #131 for the ORC reader and #139 for the ORC writer. This limitation is also documented in the compatibility document.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 13, 2022

I set orc_gen to use Timestamp after 1590 but still see failures thus this should be a new bug:

cpu = datetime.datetime(1647, 5, 20, 19, 25, 3, 638)
gpu = datetime.datetime(1647, 5, 20, 19, 25, 2, 638)

This time they are failures with second values. It seems rapidsai/cudf#11586 is not completely fixed.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 13, 2022

Convert this into draft since there are still failures that may need cudf fix.

@ttnghia ttnghia marked this pull request as draft September 13, 2022 16:21
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
@ttnghia ttnghia changed the title Reverse orc_test.py#test_read_with_more_columns Allow ORC tests to run with timestamp before 1970 year Sep 14, 2022
@ttnghia ttnghia changed the title Allow ORC tests to run with timestamp before 1970 year Allow ORC tests to run with wider range of timestamp input Sep 16, 2022
@jlowe
Copy link
Member

jlowe commented Sep 19, 2022

build

1 similar comment
@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 19, 2022

build

@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 19, 2022

build

1 similar comment
@ttnghia
Copy link
Collaborator Author

ttnghia commented Sep 19, 2022

build

@ttnghia ttnghia marked this pull request as ready for review September 20, 2022 03:04
@ttnghia ttnghia requested a review from jlowe September 20, 2022 03:04
@ttnghia ttnghia merged commit 7937fb8 into NVIDIA:branch-22.10 Sep 20, 2022
@ttnghia ttnghia deleted the reverse_orc_test branch September 22, 2022 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P1 Nice to have for release test Only impacts tests
Projects
Release 22.10
Awaiting triage
Development

Successfully merging this pull request may close these issues.

[BUG] Timestamp from GPU ORC reading is different from CPU ORC reading
2 participants