Skip to content

Conversation

Aggarwal-Raghav
Copy link
Contributor

What changes were proposed in this pull request?

Check HIVE-29012 for the stacktrace.

Why are the changes needed?

When protologging hook is enabled, it stores the query plan for a query as well. The conf object in ExplainTask is null for the "Hive Hook Proto Log Writer" thread.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Will see CI output + local single node setup/cluster testing

@Aggarwal-Raghav
Copy link
Contributor Author

Based on the debugging, the Task#initialize() method creates conf object in HiveServer2-Handler-Pool thread and when the post hook for protologging is triggered, it created a new thread (Hive Hook Proto Log Writer) which get the query plan and at that time, ExplainTask#outputPlan() has conf as null which causes NPE.

final int limit = conf.getIntVar(ConfVars.HIVE_EXPLAIN_NODE_VISIT_LIMIT);

@Aggarwal-Raghav
Copy link
Contributor Author

@zabetak, can you please help with the review as the changes are around HIVE-22173's code.

@Aggarwal-Raghav Aggarwal-Raghav changed the title [WIP] HIVE-29012: NPE in ExplainTash when protologging posthook is enabled HIVE-29012: NPE in ExplainTash when protologging posthook is enabled Jun 16, 2025
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Aggarwal-Raghav ! I left some questions/suggestions for the proposed fix.

For completeness, it would be nice to also add a unit test possibly somewhere in TestHiveProtoLoggingHook or TestHiveHookEventProtoPartialBuilder. Some fixes are trivial but writting also tests about them increases test coverage and guards against future regressions.

@Aggarwal-Raghav Aggarwal-Raghav changed the title HIVE-29012: NPE in ExplainTash when protologging posthook is enabled HIVE-29012: NPE in ExplainTask when protologging posthook is enabled Jun 18, 2025
@Aggarwal-Raghav
Copy link
Contributor Author

Just info: HIVE-27240, was very similar to this issue based on stacktrace, which shows the need for UT :-)

@Aggarwal-Raghav
Copy link
Contributor Author

Hi @zabetak, apologies for late reply, just wanted to let you know that I'm working on this PR and found some new issues like HIVE-29053 along the way when coming up with UT.

Writing UT for this PR is not straightforward and need your opinion on this. There are 2 things to note:

  1. In case of NPE, its just log the error and continue as intended, which means using TestHiveHookEventProtoPartialBuilder or TestHiveProtoLoggingHook won't cut it IMO (tried with mockito as well, didn't work).
  2. Writes/flush of proto data are not consistent based on ScheduledThreadPoolExecutor https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java#L277,
    one way to force flush is to stop the HS2.

What I/m thinking is, writing a itest/hive-unit integration test as below code. Because of [2], I'm not able to make qfile work as the proto data is written post test completition.

CREATE TABLE tbl1 (id INT, name STRING);
INSERT INTO tbl1 VALUES (1, 'Alice'), (2, 'Bob');
SELECT * FROM tbl1;

msck repair table sys.proto_hive_query_data;

select count(*) from sys.proto_hive_query_data where lower(trim(regexp_replace(get_json_object(otherinfo[1].value, '$.queryText'), '\n|\\\s+', ' '))) = "select * from tbl1";

Because of NPE the QUERY Object will not be part of proto data so without fix the output should be 0, with fix it should be 1.

@zabetak
Copy link
Member

zabetak commented Jul 4, 2025

The end to end approach (2.) seems a bit of an overkill. Even if the NPE is caught and logged you still have some missing entries (e.g., queryPlan) in the otherInfo map. You could probably base your test on constructing a simple ExplainWork object and reason about missing entries in the info map. Would that work as a unit test?

@Aggarwal-Raghav
Copy link
Contributor Author

Aggarwal-Raghav commented Jul 5, 2025

The end to end approach (2.) seems a bit of an overkill. Even if the NPE is caught and logged you still have some missing entries (e.g., queryPlan) in the otherInfo map. You could probably base your test on constructing a simple ExplainWork object and reason about missing entries in the info map. Would that work as a unit test?

Thanks for the suggestions/help @zabetak. I have come up with a test case highlighting the problem observed in local cluster. Can you please review and confirm if its up to the standards?

@Aggarwal-Raghav
Copy link
Contributor Author

Test failure org.apache.hadoop.hive.metastore.security.TestZookeeperTokenStoreSSLEnabled, Seems irrelevant to change.

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort @Aggarwal-Raghav !! The prod changes LGTM.

I just have some small (mostly nit) comments regarding the test. There are two best-practices that I wanted to highlight here:

  • Avoid mocking when its easy/simple to create real objects
  • Avoid using java assert for tests

In some cases, "best-practices" can be subjective but I feel the above are applicable/beneficial in every project.

Copy link

@Aggarwal-Raghav
Copy link
Contributor Author

@zabetak , I have addressed the review comments. can you please take a look?

@Aggarwal-Raghav Aggarwal-Raghav requested a review from zabetak July 11, 2025 02:52
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will merge it soon.

@zabetak zabetak merged commit 2b4a07e into apache:master Jul 13, 2025
4 checks passed
@zabetak
Copy link
Member

zabetak commented Jul 13, 2025

Thanks for the PR @Aggarwal-Raghav !

@Aggarwal-Raghav Aggarwal-Raghav deleted the HIVE-29012 branch July 17, 2025 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants