Skip to content
This repository has been archived by the owner on Feb 3, 2021. It is now read-only.

Spark log retrieval broken in aztk 0.10.0 #679

Closed
mmduyzend opened this issue Nov 1, 2018 · 3 comments
Closed

Spark log retrieval broken in aztk 0.10.0 #679

mmduyzend opened this issue Nov 1, 2018 · 3 comments

Comments

@mmduyzend
Copy link
Contributor

In previous releases of aztk, the Spark logs for a job submitted in cluster mode are accessible in a couple of ways:

  • Written to stderr while the job is running (assuming you haven't passed --no-wait)
  • Retrievable via aztk spark cluster app-logs

In aztk 0.10.0, when scheduling_target is set to "master", the Spark logs are not written to stdout while the job is running, and aztk spark cluster app-logs just hangs, so the logs are inaccessible.

When scheduling_target is set to "any" (the default), aztk spark cluster app-logs seems to work correctly (so logs are accessible), but the submit command (without --no-wait) raises an AttributeError when it tries to write the running logs to stderr:

  File "/home/user/.virtualenvs/aztk-lHBOMM6y/bin/aztk", line 11, in <module>
    sys.exit(main())
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk_cli/entrypoint.py", line 44, in main
    run_software(args)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk_cli/entrypoint.py", line 72, in run_software
    func(args)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk_cli/spark/endpoints/spark.py", line 25, in execute
    func(args)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk_cli/spark/endpoints/cluster/cluster.py", line 76, in execute
    func(args)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk_cli/spark/endpoints/cluster/cluster_submit.py", line 148, in execute
    exit_code = utils.stream_logs(client=spark_client, cluster_id=args.cluster_id, application_name=args.name)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk_cli/utils.py", line 128, in stream_logs
    id=cluster_id, application_name=application_name, tail=True, current_bytes=current_bytes)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/spark/client/cluster/operations.py", line 258, in get_application_log
    current_bytes)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/spark/client/cluster/helpers/get_application_log.py", line 9, in get_application_log
    base_application_log = core_base_operations.get_application_log(cluster_id, application_name, tail, current_bytes)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/client/base/base_operations.py", line 227, in get_application_log
    return get_application_log.get_application_log(self, id, application_name, tail, current_bytes)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/client/base/helpers/get_application_log.py", line 133, in get_application_log
    return get_log(base_operations, cluster_id, application_name, tail, current_bytes)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/client/base/helpers/get_application_log.py", line 95, in get_log
    task = __wait_for_app_to_be_running(base_operations, cluster_id, application_name)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/client/base/helpers/get_application_log.py", line 28, in __wait_for_app_to_be_running
    task_state = base_operations.get_task_state(cluster_id, application_name)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/client/base/base_operations.py", line 323, in get_task_state
    return get_task_state.get_task_state(self, id, task_name)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/client/base/helpers/get_task_state.py", line 15, in get_task_state
    task = core_cluster_operations.get_batch_task(cluster_id, task_id)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/client/base/base_operations.py", line 346, in get_batch_task
    return task_table.get_batch_task(self.batch_client, id, task_id)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/utils/retry.py", line 17, in wrapper
    return function(*args, **kwargs)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/utils/try_func.py", line 8, in wrapper
    return function(*args, **kwargs)
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/client/base/helpers/task_table.py", line 134, in get_batch_task
    return __convert_batch_task_to_aztk_task(batch_client.task.get(id, task_id))
  File "/home/user/.virtualenvs/aztk-lHBOMM6y/lib/python3.6/site-packages/aztk/client/base/helpers/task_table.py", line 43, in __convert_batch_task_to_aztk_task
    task.node_id = batch_task.node_info.node_id
AttributeError: 'NoneType' object has no attribute 'node_id'

Spark logs are useful for debugging, as well as for monitoring job progress, so it is important that they be accessible for all jobs, even those where scheduling_target is "master". While not absolutely essential, having the logs written to stdout while the job is running does make monitoring and debugging more convenient.

(Haven't tried with job-mode jobs but I suspect it's the same.)

@jafreck
Copy link
Member

jafreck commented Nov 1, 2018

In v0.10.0, when scheduling_target is set to "master", cluster app logs are still available, they are just not available during application execution. I have verified that logs are accessible by running aztk spark cluster app-logs -- the process looks like it hangs, but it really is just waiting for the application to complete, and for the logs to be uploaded. After application completion or failure, it will get the logs and display them. For long running jobs, this is clearly not ideal.

The scheduling_target with log streaming feature was not included in the 0.10.0 release, but I understand that it is a convenience for long running job monitoring and debugging. We can prioritize this feature to bring parity between scheduling_target.Any and scheduling_target.Master.

The AttributeError issue is definitely a bug, and will be fixed promptly. This issue should only arise when a Batch Task is submitted, but not yet scheduled. Thanks for pointing this out -- I'll put out a fix for it and add make sure out test cases cover that scenario.

@jafreck
Copy link
Member

jafreck commented Nov 1, 2018

@mmduyzend I logged a separate issue for the log streaming feature: #680. I will resolve this issue once the AttributeError bug has been resolved.

@mmduyzend
Copy link
Contributor Author

Thanks @jafreck. Looking forward to seeing fixes for these issues soon in the next release.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants