Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mlflow logging #326

Merged
merged 11 commits into from
Sep 16, 2024
Merged

Add mlflow logging #326

merged 11 commits into from
Sep 16, 2024

Conversation

smithcommajoseph
Copy link
Collaborator

@smithcommajoseph smithcommajoseph commented Aug 7, 2024

Adds support for logging model training data to an MLFlow instance.

Note: Alpha and Bravo models are well supported. Charlie and Alpha Ensemble models will require some work at time of writing. @haneslinger, I'd appreciate feedback re: how to add better logging for the above.

@haneslinger
Copy link
Collaborator

Could/should this replace Training_history.csv and save_model/load_model?

@smithcommajoseph
Copy link
Collaborator Author

Could/should this replace Training_history.csv and save_model/load_model?

I've also wondered about where the boundaries are for our dependency on MLflow.

Currently we take all artifacts saved to the local filesystem and upload them (including the model) to MLFlow. We could reduce generated artifacts (error_stats_train.json and Training_history.csv) and let MLFlow handle logging those data for us. Similar thoughts on saving/loading model fns.

On the other hand, I ALSO see an argument for letting MLFlow use be optional, not compulsory. Which is probably where I'm leaning...

@stephen-frank
Copy link
Member

How big is MLFlow, as in, how heavy of a dependency is it if we were to make it mandatory? Thinking about compactness of deploying Wattile via Docker. If it is not a heavy dependency, I don't necessarily have a problem with it being required if it makes everything simpler.

@smithcommajoseph
Copy link
Collaborator Author

smithcommajoseph commented Aug 14, 2024

TLDR:

MLFlow support adds value, but making MLFlow mandatory adds complexity and could have adverse side effects.


Full version:

If we make MLFlow mandatory, Wattile will become more brittle and less tolerant of network outages. While the MLFLow SDK has a concept of resilient sessions, where it retries unresponsive endpoints n-times over a given period, it will throw errors if the server is not found or unresponsive. If this occurs when it is time to log model metrics or artifacts, then an operator may have to repeat the training, which has fiscal and temporal impacts (I'm all about saving time and money). Wattile's current ability to log artifacts to the file system without network dependency, except for the initial package downloads and data required to train the model(s), makes it resilient to intermittent network outages. Assuming all goes well during the training phase, this pattern ensures that Wattile presents an operator with a fully functional model and its artifacts. The implementation proposed in this PR attempts to log metrics and wholesale upload all artifacts to MLFlow for storage.

Additionally, MLFlow represents one more server that an organization will need to set up and maintain. Indeed, one can run MLFlow locally for smaller jobs, but in practice, folks will likely want to use Databricks, AWS, or in-house hardware to run this server, which feels like a burdensome dependency to add. The implementation proposed in this PR makes MLFlow 100% optional. It adds value and provides operators/users a way to track, rank, and store models, which is crucial to scalable ML workflows. But if folks want to bite only some of that off when they start experimenting with Wattile, then no worries, as all core functionality is available with or without MLFlow.

Copy link
Member

@stephen-frank stephen-frank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to pretend I understand all these changes enough to be a useful reviewer, but at least I didn't see any red flags!

Bumps all packages/deps
Removes pylama (no longer maintained)
First step towards model logging in alpha models
Updates comments to the above
Adds support for Alpha ensemble registry logging/tagging
Adds default registry conf props
Removes registry from Charlie models
Adds registry object to Top level alpha ensemble models
Adds regenerated poetry.lock file
@smithcommajoseph smithcommajoseph merged commit af310a1 into develop Sep 16, 2024
2 checks passed
@smithcommajoseph smithcommajoseph deleted the add-mlflow-logging branch September 16, 2024 18:52
haneslinger pushed a commit that referenced this pull request Oct 9, 2024
* Updates python to 3.12

Bumps all packages/deps
Removes pylama (no longer maintained)

* Adds model_registry class
First step towards model logging in alpha models

* Adds model logging and snapshotting of metrics to alpha and bravo models
Add model logging to charlie

* Adds docs for model registry related config

* Updates poetry.lock

* Cleans up rmodel registry methods
Updates comments to the above

* Fixes formatting errors in alpha_model

* Fixes isort formatting error in base_model

* Improves registry tagging defaults for searching/sorting results
Adds support for Alpha ensemble registry logging/tagging
Adds default registry conf props

* Fixes version solve issues w/ MLFlow and numpy
Removes registry from Charlie models
Adds registry object to Top level alpha ensemble models

* Updates base configs.json
Adds regenerated poetry.lock file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants