Skip to content
This repository has been archived by the owner on Feb 3, 2021. It is now read-only.

Feature: spark debug tool #455

Merged
merged 29 commits into from
Apr 9, 2018

Conversation

jafreck
Copy link
Member

@jafreck jafreck commented Mar 22, 2018

Fix #347

aztk/client.py Outdated
pool, nodes = self.__get_pool_details(cluster_id)
nodes = [node for node in nodes]
cluster_nodes = [self.__get_remote_login_settings(pool.id, node.id) for node in nodes]
cluster_nodes = [(node, self.__get_remote_login_settings(pool.id, node.id)) for node in nodes]
try:
ssh_key = self.__create_user_on_pool('aztk', pool.id, nodes)
asyncio.get_event_loop().run_until_complete(ssh_lib.clus_copy(container_name=container_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add

output = asyncio ...
return output

?

remote_path = "/tmp/debug.zip"
output = spark_client.cluster_copy(cluster_id, remote_path, local_path, host=True, get=True)
# write run output to debug/ directory
with open(os.path.join(os.path.dirname(local_path), "debug-output.txt"), 'w', encoding="UTF-8") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be 'w' or 'w+' (for overwrite)? Not sure what the right thing here would be unless the logs have timestamps on them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by this. debug-output.txt is the output from the cluster_run command. It is mainly just there to see if the tool crashed or not.
The file is also only written once so I'm not sure why I would overwrite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression you could run the tool multiple times. If that happens, do we want to append or overwrite?

Diagnostic program that runs on each node in the cluster
This program must be run with sudo
"""
import io
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: alpha sort

Copy link
Member Author

@jafreck jafreck Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are PEP8 sorted, and alpha sorted within the groups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why I felt these were not sorted... You're right.

else:
cmd = '/bin/bash 2>&1 -c \'set -e; set -o pipefail; {0}; wait\''.format(command)
stdin, stdout, stderr = client.exec_command(cmd, get_pty=True)
# [print(line.decode('utf-8')) for line in stdout.read().splitlines()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented log?

result = spark_client.cluster_run(args.cluster_id, args.command)
results = spark_client.cluster_run(args.cluster_id, args.command)
for result in results:
print("---------------------------") #TODO: replace with nodename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo?

except batch_error.BatchErrorException as e:
raise error.AztkError(helpers.format_batch_exception(e))

def cluster_copy(self, cluster_id: str, source_path: str, destination_path: str):
def cluster_copy(self, cluster_id: str, source_path: str, destination_path: str, host=False, get=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does get do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get means retrieve files from the nodes. previously cluster copy was limited to copying a local file to all nodes on the cluster. Now it can work both ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hhm, I find that a bit weird to have that this way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would you suggest? is it just that the name of the parameter confusing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if having two methods would be more clear?

def copy_to_cluster(...):
    ...

def copy_from_cluster(...):
    ...

Thoguhts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A change like that would be breaking for any script using cluster_copy() today, unless we have cluster_copy() and copy_from_cluster(). I feel like that is not the best naming.

In general, I think we should consider an entire SDK rewrite to align cluster and job function names (didn't do a particularly good job naming them). It would also be nice to split the client so it has a cluster and a job submodule. So you would do client.cluster.get_log() or client.job.get_log().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm down to rename stuff. We can either keep the old stuff to call the new function and mark as depractated or just remove them as we are technically only releasing it next version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think deprecating is probably better with something like that https://stackoverflow.com/questions/2536307/how-do-i-deprecate-python-functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depreciating is fine, I think so long as we have a set time frame (maybe 1 or 2 versions) where we actually remove the code.

parser.add_argument('--id', dest='cluster_id', required=True,
help='The unique id of your spark cluster')

parser.add_argument('--output', '-o', required=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we make that optional to be something like aztk_debug/[cluster_id] by default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's a good idea. by default, it can be debug-{cluster-id}/ in the working directory.

pylintrc Outdated
@@ -9,7 +9,7 @@

# Add files or directories to the blacklist. They should be base names, not
# paths.
ignore=CVS
ignore=CVS,debug.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea but it does not seem necessary at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I remember:
image

The debug.py file has a dependency that is installed at runtime (docker python module). Travis doesn't have that dependency so the build fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add this as a dev dependency then or have travis install separately

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to single # pylint: disable=import-error on the import line and it seems to work fine.

@jafreck jafreck merged commit 44a0765 into Azure:master Apr 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants