-
Notifications
You must be signed in to change notification settings - Fork 66
Conversation
…tk into feature/spark-diagnostic-tool
aztk/client.py
Outdated
pool, nodes = self.__get_pool_details(cluster_id) | ||
nodes = [node for node in nodes] | ||
cluster_nodes = [self.__get_remote_login_settings(pool.id, node.id) for node in nodes] | ||
cluster_nodes = [(node, self.__get_remote_login_settings(pool.id, node.id)) for node in nodes] | ||
try: | ||
ssh_key = self.__create_user_on_pool('aztk', pool.id, nodes) | ||
asyncio.get_event_loop().run_until_complete(ssh_lib.clus_copy(container_name=container_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add
output = asyncio ...
return output
?
remote_path = "/tmp/debug.zip" | ||
output = spark_client.cluster_copy(cluster_id, remote_path, local_path, host=True, get=True) | ||
# write run output to debug/ directory | ||
with open(os.path.join(os.path.dirname(local_path), "debug-output.txt"), 'w', encoding="UTF-8") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be 'w' or 'w+' (for overwrite)? Not sure what the right thing here would be unless the logs have timestamps on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by this. debug-output.txt is the output from the cluster_run command. It is mainly just there to see if the tool crashed or not.
The file is also only written once so I'm not sure why I would overwrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression you could run the tool multiple times. If that happens, do we want to append or overwrite?
Diagnostic program that runs on each node in the cluster | ||
This program must be run with sudo | ||
""" | ||
import io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: alpha sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are PEP8 sorted, and alpha sorted within the groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea why I felt these were not sorted... You're right.
aztk/utils/ssh.py
Outdated
else: | ||
cmd = '/bin/bash 2>&1 -c \'set -e; set -o pipefail; {0}; wait\''.format(command) | ||
stdin, stdout, stderr = client.exec_command(cmd, get_pty=True) | ||
# [print(line.decode('utf-8')) for line in stdout.read().splitlines()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented log?
result = spark_client.cluster_run(args.cluster_id, args.command) | ||
results = spark_client.cluster_run(args.cluster_id, args.command) | ||
for result in results: | ||
print("---------------------------") #TODO: replace with nodename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo?
aztk/spark/client.py
Outdated
except batch_error.BatchErrorException as e: | ||
raise error.AztkError(helpers.format_batch_exception(e)) | ||
|
||
def cluster_copy(self, cluster_id: str, source_path: str, destination_path: str): | ||
def cluster_copy(self, cluster_id: str, source_path: str, destination_path: str, host=False, get=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does get
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get means retrieve files from the nodes. previously cluster copy was limited to copying a local file to all nodes on the cluster. Now it can work both ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hhm, I find that a bit weird to have that this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would you suggest? is it just that the name of the parameter confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if having two methods would be more clear?
def copy_to_cluster(...):
...
def copy_from_cluster(...):
...
Thoguhts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A change like that would be breaking for any script using cluster_copy()
today, unless we have cluster_copy()
and copy_from_cluster()
. I feel like that is not the best naming.
In general, I think we should consider an entire SDK rewrite to align cluster and job function names (didn't do a particularly good job naming them). It would also be nice to split the client so it has a cluster
and a job
submodule. So you would do client.cluster.get_log()
or client.job.get_log()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I'm down to rename stuff. We can either keep the old stuff to call the new function and mark as depractated or just remove them as we are technically only releasing it next version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think deprecating is probably better with something like that https://stackoverflow.com/questions/2536307/how-do-i-deprecate-python-functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depreciating is fine, I think so long as we have a set time frame (maybe 1 or 2 versions) where we actually remove the code.
parser.add_argument('--id', dest='cluster_id', required=True, | ||
help='The unique id of your spark cluster') | ||
|
||
parser.add_argument('--output', '-o', required=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we make that optional to be something like aztk_debug/[cluster_id]
by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's a good idea. by default, it can be debug-{cluster-id}/
in the working directory.
pylintrc
Outdated
@@ -9,7 +9,7 @@ | |||
|
|||
# Add files or directories to the blacklist. They should be base names, not | |||
# paths. | |||
ignore=CVS | |||
ignore=CVS,debug.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea but it does not seem necessary at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add this as a dev dependency then or have travis install separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to single # pylint: disable=import-error
on the import line and it seems to work fine.
…tk into feature/spark-diagnostic-tool
Fix #347