Add mlflow logging #326

smithcommajoseph · 2024-08-07T15:15:04Z

Adds support for logging model training data to an MLFlow instance.

Note: Alpha and Bravo models are well supported. Charlie and Alpha Ensemble models will require some work at time of writing. @haneslinger, I'd appreciate feedback re: how to add better logging for the above.

haneslinger · 2024-08-07T21:32:56Z

Could/should this replace Training_history.csv and save_model/load_model?

smithcommajoseph · 2024-08-12T19:21:14Z

Could/should this replace Training_history.csv and save_model/load_model?

I've also wondered about where the boundaries are for our dependency on MLflow.

Currently we take all artifacts saved to the local filesystem and upload them (including the model) to MLFlow. We could reduce generated artifacts (error_stats_train.json and Training_history.csv) and let MLFlow handle logging those data for us. Similar thoughts on saving/loading model fns.

On the other hand, I ALSO see an argument for letting MLFlow use be optional, not compulsory. Which is probably where I'm leaning...

stephen-frank · 2024-08-12T19:35:18Z

How big is MLFlow, as in, how heavy of a dependency is it if we were to make it mandatory? Thinking about compactness of deploying Wattile via Docker. If it is not a heavy dependency, I don't necessarily have a problem with it being required if it makes everything simpler.

smithcommajoseph · 2024-08-14T13:23:39Z

TLDR:

MLFlow support adds value, but making MLFlow mandatory adds complexity and could have adverse side effects.

Full version:

If we make MLFlow mandatory, Wattile will become more brittle and less tolerant of network outages. While the MLFLow SDK has a concept of resilient sessions, where it retries unresponsive endpoints n-times over a given period, it will throw errors if the server is not found or unresponsive. If this occurs when it is time to log model metrics or artifacts, then an operator may have to repeat the training, which has fiscal and temporal impacts (I'm all about saving time and money). Wattile's current ability to log artifacts to the file system without network dependency, except for the initial package downloads and data required to train the model(s), makes it resilient to intermittent network outages. Assuming all goes well during the training phase, this pattern ensures that Wattile presents an operator with a fully functional model and its artifacts. The implementation proposed in this PR attempts to log metrics and wholesale upload all artifacts to MLFlow for storage.

Additionally, MLFlow represents one more server that an organization will need to set up and maintain. Indeed, one can run MLFlow locally for smaller jobs, but in practice, folks will likely want to use Databricks, AWS, or in-house hardware to run this server, which feels like a burdensome dependency to add. The implementation proposed in this PR makes MLFlow 100% optional. It adds value and provides operators/users a way to track, rank, and store models, which is crucial to scalable ML workflows. But if folks want to bite only some of that off when they start experimenting with Wattile, then no worries, as all core functionality is available with or without MLFlow.

stephen-frank

I'm not going to pretend I understand all these changes enough to be a useful reviewer, but at least I didn't see any red flags!

Bumps all packages/deps Removes pylama (no longer maintained)

First step towards model logging in alpha models

Add model logging to charlie

Updates comments to the above

Adds support for Alpha ensemble registry logging/tagging Adds default registry conf props

Removes registry from Charlie models Adds registry object to Top level alpha ensemble models

Adds regenerated poetry.lock file

* Updates python to 3.12 Bumps all packages/deps Removes pylama (no longer maintained) * Adds model_registry class First step towards model logging in alpha models * Adds model logging and snapshotting of metrics to alpha and bravo models Add model logging to charlie * Adds docs for model registry related config * Updates poetry.lock * Cleans up rmodel registry methods Updates comments to the above * Fixes formatting errors in alpha_model * Fixes isort formatting error in base_model * Improves registry tagging defaults for searching/sorting results Adds support for Alpha ensemble registry logging/tagging Adds default registry conf props * Fixes version solve issues w/ MLFlow and numpy Removes registry from Charlie models Adds registry object to Top level alpha ensemble models * Updates base configs.json Adds regenerated poetry.lock file

smithcommajoseph requested review from haneslinger and stephen-frank August 7, 2024 15:15

stephen-frank approved these changes Aug 15, 2024

View reviewed changes

stephen-frank mentioned this pull request Aug 19, 2024

Add support for MLflow logging #325

Closed

smithcommajoseph force-pushed the add-mlflow-logging branch 2 times, most recently from d3cab6b to 556cadd Compare August 28, 2024 14:30

smithcommajoseph added 11 commits September 16, 2024 14:26

Updates python to 3.12

3a89448

Bumps all packages/deps Removes pylama (no longer maintained)

Adds model_registry class

062ceb4

First step towards model logging in alpha models

Adds model logging and snapshotting of metrics to alpha and bravo models

25c2223

Add model logging to charlie

Adds docs for model registry related config

2660abb

Updates poetry.lock

a425155

Cleans up rmodel registry methods

b49be52

Updates comments to the above

Fixes formatting errors in alpha_model

55162ff

Fixes isort formatting error in base_model

a4b8944

Improves registry tagging defaults for searching/sorting results

a6cc7cf

Adds support for Alpha ensemble registry logging/tagging Adds default registry conf props

Fixes version solve issues w/ MLFlow and numpy

65284c0

Removes registry from Charlie models Adds registry object to Top level alpha ensemble models

Updates base configs.json

4e946d1

Adds regenerated poetry.lock file

smithcommajoseph force-pushed the add-mlflow-logging branch from 556cadd to 4e946d1 Compare September 16, 2024 18:44

smithcommajoseph merged commit af310a1 into develop Sep 16, 2024
2 checks passed

smithcommajoseph deleted the add-mlflow-logging branch September 16, 2024 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mlflow logging #326

Add mlflow logging #326

smithcommajoseph commented Aug 7, 2024 •

edited

Loading

haneslinger commented Aug 7, 2024

smithcommajoseph commented Aug 12, 2024

stephen-frank commented Aug 12, 2024

smithcommajoseph commented Aug 14, 2024 •

edited

Loading

stephen-frank left a comment

Add mlflow logging #326

Add mlflow logging #326

Conversation

smithcommajoseph commented Aug 7, 2024 • edited Loading

haneslinger commented Aug 7, 2024

smithcommajoseph commented Aug 12, 2024

stephen-frank commented Aug 12, 2024

smithcommajoseph commented Aug 14, 2024 • edited Loading

stephen-frank left a comment

Choose a reason for hiding this comment

smithcommajoseph commented Aug 7, 2024 •

edited

Loading

smithcommajoseph commented Aug 14, 2024 •

edited

Loading