Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TaskFailureListener at JVM side #41

Merged
merged 14 commits into from
Jul 25, 2022

Conversation

wjxiz1992
Copy link
Collaborator

@wjxiz1992 wjxiz1992 commented Jun 30, 2022

Signed-off-by: Allen Xu allxu@nvidia.com

close #39, #33, #37

To enable the listener, append this config to the submit command:

--conf spark.extraListeners=com.nvidia.spark.rapids.listener.TaskFailureListener

Don't forget to append NDSBenchmarkListener-1.0-SNAPSHOT.jar to the --jars config.
When this is not set, the listener will not be registered to SparkContext.

Update, performance tests w/ and w/o this new listener by the same Spark submit configs on Dataproc :

index power test time with listener/ms power test time without listener /ms
1 2734682 2640671
2 2617317 2757564
3 2715048 2779721
4 2772292 2664533
avg 2709834.75 2710622.25

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 self-assigned this Jun 30, 2022
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992
Copy link
Collaborator Author

Currently tested on Dataproc environment with 3TB data run, no RPC errors when shutting down SparkContext.

22/06/30 10:13:12 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@6d7430e3{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
====== Power Test Time: 2859843 milliseconds ======
====== Total Time: 2929142 milliseconds ======

Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 marked this pull request as ready for review July 4, 2022 10:04
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992
Copy link
Collaborator Author

@GaryShen2008 After discussion with @pxLi , I changed the version from 22.06-SNAPSHOT to 1.0-SNAPSHOT so that we can build it manually and deploy to URM as these scala code are not planned to change much in the future. What do you think?

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 changed the title Add PythonTaskFailureListener at JVM side Add TaskFailureListener at JVM side Jul 5, 2022
Signed-off-by: Allen Xu <allxu@nvidia.com>
spark_env = dict(self.spark_session.sparkContext.getConf().getAll())
if 'spark.extraListeners' in spark_env.keys() and 'com.nvidia.spark.rapids.listener.TaskFailureListener' in spark_env['spark.extraListeners']:
listener = python_listener.PythonListener()
listener.register()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we print out something to tell the user that the listener is not enabled if there's no spark.extraListeners?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

start_time = int(time.time() * 1000)
fn(*args)
end_time = int(time.time() * 1000)
if len(listener.failures) != 0:
if listener and len(listener.failures) != 0:
# NOTE: when listener is not used, the queryStatus field will always be "Completed" in json summary
self.summary['queryStatus'].append("CompletedWithTaskFailures")
else:
self.summary['queryStatus'].append("Completed")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no listener, I'm not sure why we can say it's completed. The logic has been changed here.
Previous, the queryStatus will be always reported with listener. So, the "Completed" status always means the query ran succeeded without any failed task.
So, at least print a message to tell the user the Completed status doesn't mean no failed task when the listener is not registered.

Copy link
Collaborator Author

@wjxiz1992 wjxiz1992 Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can even add a new value unknown for queryStatus when listener is not in use. Is it worth adding it or just leave a message to say this field is valid?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a message should be enough. No need new status.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 merged commit 29c4627 into NVIDIA:branch-22.06 Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement TaskFailure listener at JVM side
2 participants