Skip to content
This repository has been archived by the owner on May 15, 2020. It is now read-only.

Monitor SageMaker updates #113

Merged
merged 43 commits into from Jan 11, 2019
Merged

Monitor SageMaker updates #113

merged 43 commits into from Jan 11, 2019

Conversation

ksaaskil
Copy link
Contributor

@ksaaskil ksaaskil commented Jan 10, 2019

Sorry for the huge PR, it took quite some effort to get stuff working.

  • Fetch job metrics using sagemaker.analytics
  • Compute the difference of records in previous and new metrics, assuming that sagemaker.analytics appends new rows to the end of DataFrame once new values emerge. This should be checked
  • Add new scalars to the job via job.add_scalar
  • Notify updates via the same notify method as for scheduler jobs
  • Update example notebook to use a cheaper instance (takes three hours to train now), add some new metrics to be outputted by the script (learning rate and training loss)
  • Copy-paste code from Scheduler for querying and posting job metrics, these should probably go to TrackerBase or JobMonitorBase to avoid duplicating code
  • Use threading.Lock in initializing SageMakerHelper connections to avoid race conditions
  • Minimum polling interval of 60 secs in SageMakerJobMonitor. It only notifies when new values are available, though.
  • Change meeshkan stop to not raise an exception when used before starting the agent
  • Create terminate_daemon event in the daemon process, this should be more robust than setting it before spawning and forking (?)
  • Something's broken in Azure tests, I could also reproduce this locally at some point... EDIT: Added longer waiting times and changed maxParallel to one. Also changed terminate_daemon to be created in the API process. The process can only be stopped via the Pyro proxy anyway in real usage, as only tests re-use the Service instance for start -> stop etc.

@ksaaskil ksaaskil changed the title WIP Monitor sagemaker updates WIP Monitor SageMaker updates Jan 10, 2019
@ksaaskil ksaaskil changed the title WIP Monitor SageMaker updates Monitor SageMaker updates Jan 11, 2019
@ksaaskil ksaaskil merged commit b569a4e into dev Jan 11, 2019
@ksaaskil ksaaskil deleted the monitor-sagemaker-updates branch January 11, 2019 15:07
Copy link
Contributor

@idantene idantene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@@ -197,6 +197,7 @@ def get_updates(self, job_id: uuid.UUID) -> HistoryByScalar:
@Pyro4.expose
def stop(self):
if self.__was_shutdown:
LOGGER.warning("Agent API was shutdown twice.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this happen?

@@ -91,8 +106,8 @@ def get_job_status(self, job_name) -> JobStatus:
:raises JobNotFoundException: If job was not found.
:return: Job status
"""

self.check_or_build_connection()
with self.lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a thread lock? Are there any race conditions?

LOGGER.exception("Polling for updates failed")
# Ignore

# TODO Remove duplicate code with Scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goes back to perhaps having a common base class with Scheduler 🤔

@@ -66,6 +67,8 @@ def daemonize(self, serialized):
pid = os.fork()
if pid > 0: # Close parent process
return
if not self.terminate_daemon:
self.terminate_daemon = multiprocessing.get_context("spawn").Event()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiprocessing.get_context("spawn") should also be abstracted away somehow so we don't duplicate it everywhere

@ksaaskil ksaaskil mentioned this pull request Jan 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants