-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mlflow logging #326
Add mlflow logging #326
Conversation
Could/should this replace |
I've also wondered about where the boundaries are for our dependency on MLflow. Currently we take all artifacts saved to the local filesystem and upload them (including the model) to MLFlow. We could reduce generated artifacts ( On the other hand, I ALSO see an argument for letting MLFlow use be optional, not compulsory. Which is probably where I'm leaning... |
How big is MLFlow, as in, how heavy of a dependency is it if we were to make it mandatory? Thinking about compactness of deploying Wattile via Docker. If it is not a heavy dependency, I don't necessarily have a problem with it being required if it makes everything simpler. |
TLDR: MLFlow support adds value, but making MLFlow mandatory adds complexity and could have adverse side effects. Full version: If we make MLFlow mandatory, Wattile will become more brittle and less tolerant of network outages. While the MLFLow SDK has a concept of resilient sessions, where it retries unresponsive endpoints n-times over a given period, it will throw errors if the server is not found or unresponsive. If this occurs when it is time to log model metrics or artifacts, then an operator may have to repeat the training, which has fiscal and temporal impacts (I'm all about saving time and money). Wattile's current ability to log artifacts to the file system without network dependency, except for the initial package downloads and data required to train the model(s), makes it resilient to intermittent network outages. Assuming all goes well during the training phase, this pattern ensures that Wattile presents an operator with a fully functional model and its artifacts. The implementation proposed in this PR attempts to log metrics and wholesale upload all artifacts to MLFlow for storage. Additionally, MLFlow represents one more server that an organization will need to set up and maintain. Indeed, one can run MLFlow locally for smaller jobs, but in practice, folks will likely want to use Databricks, AWS, or in-house hardware to run this server, which feels like a burdensome dependency to add. The implementation proposed in this PR makes MLFlow 100% optional. It adds value and provides operators/users a way to track, rank, and store models, which is crucial to scalable ML workflows. But if folks want to bite only some of that off when they start experimenting with Wattile, then no worries, as all core functionality is available with or without MLFlow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to pretend I understand all these changes enough to be a useful reviewer, but at least I didn't see any red flags!
d3cab6b
to
556cadd
Compare
Bumps all packages/deps Removes pylama (no longer maintained)
First step towards model logging in alpha models
Add model logging to charlie
Updates comments to the above
Adds support for Alpha ensemble registry logging/tagging Adds default registry conf props
Removes registry from Charlie models Adds registry object to Top level alpha ensemble models
Adds regenerated poetry.lock file
556cadd
to
4e946d1
Compare
* Updates python to 3.12 Bumps all packages/deps Removes pylama (no longer maintained) * Adds model_registry class First step towards model logging in alpha models * Adds model logging and snapshotting of metrics to alpha and bravo models Add model logging to charlie * Adds docs for model registry related config * Updates poetry.lock * Cleans up rmodel registry methods Updates comments to the above * Fixes formatting errors in alpha_model * Fixes isort formatting error in base_model * Improves registry tagging defaults for searching/sorting results Adds support for Alpha ensemble registry logging/tagging Adds default registry conf props * Fixes version solve issues w/ MLFlow and numpy Removes registry from Charlie models Adds registry object to Top level alpha ensemble models * Updates base configs.json Adds regenerated poetry.lock file
Adds support for logging model training data to an MLFlow instance.
Note: Alpha and Bravo models are well supported. Charlie and Alpha Ensemble models will require some work at time of writing. @haneslinger, I'd appreciate feedback re: how to add better logging for the above.