Feature: Spark scheduling target #661

jafreck · 2018-09-20T00:05:10Z

Reintroduce scheduling_target with a direct-to-node implementation that does not rely on Batch scheduling.

Reqs:

This implementation should be aligned as much as possible with the standard Batch Task scheduling feature to maximize code reuse.

Fix #670
Fix #527

jafreck · 2018-09-20T00:06:21Z

aztk/client/base/helpers/get_application_log.py

@@ -68,7 +68,9 @@ def get_log(batch_client, blob_client, cluster_id: str, application_name: str, t

    task = __wait_for_app_to_be_running(batch_client, cluster_id, application_name)

-    if not __check_task_node_exist(batch_client, cluster_id, task):
+    #TODO: find a better way to detect ghost tasks -- metatdata


This needs to be done before merging. Only detecting TaskState.completed is too flimsy, especially around many concurrent application submits.

…eduling-target

timotheeguerin · 2018-09-24T19:42:24Z

aztk/node_scripts/submit.py

+
+
+def _download_resource_file(task_id, resource_file):
+    # timeout = 30 # set to default blob download timeout


uncomment or remove?

timotheeguerin · 2018-09-24T19:42:59Z

aztk/node_scripts/submit.py

+    if resource_file.file_path:
+        write_path = os.path.join(os.environ.get("AZ_BATCH_TASK_WORKING_DIR"), resource_file.file_path)
+        with open(write_path, 'wb') as stream:
+            for chunk in response.iter_content(chunk_size=16777216):


16777216 magic number, comment why this?

timotheeguerin · 2018-09-24T19:47:23Z

aztk/node_scripts/submit.py

+        serialized_task_sas_url = sys.argv[1]
+
+        try:
+            return_code = ssh_submit(serialized_task_sas_url)


what happens if the ssh connection is killed?

the ssh session only exists to kick off this process, it does not live until completion. So this return_code is not sent back to the client.

This could be written to storage for retrieval, though. Spark application status, execution time, and return value are still open issues here.

ok sounds good then

timotheeguerin · 2018-10-22T20:00:30Z

.vscode/settings.json

@@ -13,7 +13,7 @@
  ],
  "python.formatting.provider": "yapf",
  "python.venvPath": "${workspaceFolder}/.venv/",
-  "python.pythonPath": "${workspaceFolder}/.venv/Scripts/python.exe",
+  "python.pythonPath": ".venv\\Scripts\\python.exe",


Can you make this a windows specific settings(I think if you do

"windows": { "python.pythonPath": ".venv\\Scripts\\python.exe" }

vscode doesn't like the windows key, not sure if this is possible. Do you have any docs on this? Would be really nice to have OS specific settings.

timotheeguerin · 2018-10-22T20:11:39Z

aztk/client/base/helpers/get_task_status.py

+
+def get_task_status(core_cluster_operations, cluster_id: str, task_id: str):
+    try:
+        # TODO: return TaskState object instead of str


aztk/client/base/helpers/get_application_log.py

aztk/client/base/helpers/list_tasks.py

aztk/utils/retry.py

tests/integration_tests/spark/sdk/clean_up_cluster.py

tests/integration_tests/spark/sdk/cluster/test_cluster.py

timotheeguerin · 2018-10-22T20:41:00Z

aztk/node_scripts/scheduling/job_submission.py

+            try:
+                tasks.append(yaml.load(stream))
+            except yaml.YAMLError as exc:
+                print(exc)


print => log

This is captured in the task output and uploaded to storage. For user visibility into task errors, I think we should upload errors.

aztk/node_scripts/scheduling/job_submission.py

jafreck added 9 commits September 18, 2018 10:15

initial

427b43a

update pipfile and pipfile.lock

d657e4a

uncomment scheduling target, start ssh_submit impl

c7edd96

get rid of debug code

55479fa

finish ssh_submit implementation

7a66949

serialize object instead of properties

cb096bd

fix upload log bug, temp workaround for get logs

9c6bf99

remove unused function

a6cdc23

clean up node_scripts submit, remove debug code

dd5a9c0

jafreck commented Sep 20, 2018

View reviewed changes

jafreck added 2 commits September 20, 2018 09:58

ensure warns on deprecated test

df992bf

Merge remote-tracking branch 'upstream/master' into feature/spark-sch…

30b3d54

…eduling-target

jafreck requested review from timotheeguerin, matthchr, xingwu1, bgklein, johnnyzhang82 and brnleehng September 20, 2018 20:00

timotheeguerin approved these changes Sep 24, 2018

View reviewed changes

jafreck added 11 commits September 24, 2018 15:32

remove commented timeout

89c630c

start scheduling_target for job_submission

6315421

continue job scheduling target implementation

53016e1

update pipefile.lock

bee1872

update Pipfile deps, pin pynacl to fix build failure

eddbad1

fix syntax

44fb665

fix pipfile with latest azure-nspkg

50626e9

update path for scheduling scripts

1d92881

update config.py import

92e7e23

add nohup dependency

23a5bc7

use nohup and exit immediately

981c267

jafreck added 19 commits October 15, 2018 11:17

start fix job tests

77a0160

move get_task_status to base

de761f1

fix job tests

b500929

fix get_application, add abstraction to batch task gets

e338c1c

fix some bugs, remove some debug statements

963a3fe

fix test

da171ef

use jobstate and application state

c840c24

add start_task retries

b300d84

make jobstate an enum

5eb9da3

fix import

7bbe225

fixes

f123736

fixes

5a1abb4

revert settings.json

d427746

fixes for application state in cli

f3f8e8e

conditionally create storage table

8320f86

remove commented code

ffbbd79

conditionally create storage table

5620c5b

remove commented code

68479c8

fix test

587e294

timotheeguerin reviewed Oct 22, 2018

View reviewed changes

jafreck added 8 commits October 22, 2018 14:49

respond to comments

634cd19

fix debug statement, fix starttask issue

dc2952b

remove debug test print

021df1a

formatting

8056348

update doc string with correct return value

19dc890

revert settings.json

60052e3

more robust starget test, fix get_application for starget

05d5c7e

whitespace

dd815f8

jafreck merged commit 4408c4f into Azure:master Oct 23, 2018

jafreck mentioned this pull request Oct 24, 2018

Feature: Spark retry docker pull #672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Spark scheduling target #661

Feature: Spark scheduling target #661

jafreck commented Sep 20, 2018 •

edited

jafreck Sep 20, 2018

timotheeguerin Sep 24, 2018

timotheeguerin Sep 24, 2018

timotheeguerin Sep 24, 2018

jafreck Sep 24, 2018

timotheeguerin Sep 24, 2018

timotheeguerin Oct 22, 2018

jafreck Oct 22, 2018

timotheeguerin Oct 22, 2018

timotheeguerin Oct 22, 2018

jafreck Oct 22, 2018



		def _download_resource_file(task_id, resource_file):
		# timeout = 30 # set to default blob download timeout

Feature: Spark scheduling target #661

Feature: Spark scheduling target #661

Conversation

jafreck commented Sep 20, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jafreck commented Sep 20, 2018 •

edited