-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lin 673 lineapy.get for MLflow #829
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
b4fcf3c
LIN-674, LIN-671 add mlflow configs (#825)
mingjerli 8ef41a4
LIN-672 lineapy.save for mlflow (#828)
mingjerli d4d8bba
WIP-lineapy-get-metadata
mingjerli c443b09
WIP - Implement Artifact.get_value and Artifact.get_metadata
mingjerli 88c2e5f
Implement delete for MLflow
mingjerli 509e5e8
Add statsmodels and xgboost serializer/deserializer for MLflow
mingjerli 0d3e9f8
Add doc
mingjerli 8cb1a02
Add RTD for MLflow
mingjerli 0743102
Update docs to address PR review
mingjerli 4d9718c
Address PR feedback
mingjerli 7188ee9
Change mlflow deletion db logic
mingjerli 8ac717b
rebase
mingjerli 62770eb
refactor common code for different storage backend saving logic
mingjerli 1c5ace1
Add doc for backend storage
mingjerli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -179,3 +179,6 @@ Untitled*.ipynb | |
tests/outputs | ||
*.pickle | ||
.linea/linea_pickles | ||
|
||
# mlflow | ||
mlruns/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
lineapy.plugins.serializers package | ||
=================================== | ||
|
||
Submodules | ||
---------- | ||
|
||
lineapy.plugins.serializers.mlflow\_io module | ||
--------------------------------------------- | ||
|
||
.. automodule:: lineapy.plugins.serializers.mlflow_io | ||
:members: | ||
|
||
Module contents | ||
--------------- | ||
|
||
.. automodule:: lineapy.plugins.serializers | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
14 changes: 14 additions & 0 deletions
14
docs/source/guide/manage_artifacts/storage_backend/index.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Changing Storage Backend | ||
======================== | ||
|
||
Out of the box, LineaPy is the default storage backend for all artifacts. | ||
For some existing storage systems(MLflow, database ...) used to save artifacts; saving one more copy in LineaPy causes syncing issue between the two systems. | ||
Thus, LineaPy supports using different storage backends for some data types. | ||
This support is essential for users to leverage functionalities from both LineaPy and their familiar tools. | ||
|
||
Currently, LineaPy supports MLflow as a storage backend for ML models. | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
mlflow |
110 changes: 110 additions & 0 deletions
110
docs/source/guide/manage_artifacts/storage_backend/mlflow.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
.. _mlflow: | ||
|
||
Using MLflow as Storage Backend to Save ML Models | ||
================================================= | ||
|
||
.. include:: ../../../snippets/slack_support.rstinc | ||
|
||
By default, LineaPy uses LineaPy to save artifacts for all object types. | ||
However, for users who have access to MLflow, MLflow might be their first choice to save the ML model. | ||
Thus, we enable using MLflow as the backend storage for ML models. | ||
|
||
Configure MLflow | ||
---------------- | ||
|
||
Depend on how our MLflow is configured. We might need to specify ``tracking URI`` and (optional) ``registry URI``in MLflow to start using MLflow. | ||
|
||
.. code:: python | ||
|
||
mlflow.set_tracking_uri('your_mlflow_tracking_uri') | ||
mlflow.set_registry_uri('your_mlflow_registry_uri') | ||
|
||
To let LineaPy be aware of the existence of MLflow, we need to set corresponding config items if we want to use MLflow as the storage backend for ML models. | ||
|
||
.. code:: python | ||
|
||
lineapy.options.set('mlflow_tracking_uri','your_mlflow_tracking_uri') | ||
lineapy.options.set('mlflow_registry_uri','your_mlflow_registry_uri') | ||
|
||
|
||
.. note:: | ||
|
||
For objects not supported by MLflow, it will fall back to using LineaPy as the storage backend as usual. | ||
|
||
Set Default Storage Backend for ML Models | ||
----------------------------------------- | ||
|
||
Each user might have a different usage pattern for MLflow; some might use it for logging purposes and record all developing models. Some might treat it as a public space and only publish models that meet specific criteria to MLflow. | ||
In the first case, users want to use MLflow to save artifacts(ML models) by default, and in the second case, users only want to use MLflow to save artifacts when they want. | ||
Thus, we provide an option(``default_ml_models_storage_backend``) to let users decide the default storage backend for ML models when ``mlflow_tracking_uri`` has been set. | ||
|
||
Here are behaviors about which storage backend to use for ML models: | ||
|
||
* Only set ``mlflow_tracking_uri`` but not ``default_ml_models_storage_backend`` | ||
|
||
.. code:: python | ||
|
||
lineapy.options.set("mlflow_tracking_uri", "databricks") | ||
|
||
lineapy.save(model, 'model') # Use MLflow (if mlflow_tracking_uri is set, default value of default_ml_models_storage_backend is mlflow ) | ||
lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow | ||
lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy | ||
|
||
|
||
* Set ``mlflow_tracking_uri`` and ``default_ml_models_storage_backend=='mlflow'`` | ||
|
||
.. code:: python | ||
|
||
lineapy.options.set("mlflow_tracking_uri", "databricks") | ||
lineapy.options.set("default_ml_models_storage_backend", "mlflow") | ||
lineapy.save(model, 'model') # Use MLflow | ||
lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow | ||
lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy | ||
|
||
|
||
* Set ``mlflow_tracking_uri`` and ``default_ml_models_storage_backend=='lineapy'`` | ||
|
||
.. code:: python | ||
|
||
lineapy.options.set("mlflow_tracking_uri", "databricks") | ||
lineapy.options.set("default_ml_models_storage_backend", "lineapy") | ||
lineapy.save(model, 'model') # Use LineaPy | ||
lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow | ||
lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy | ||
|
||
Note that when using MLflow as storage backend, ``lineapy.save`` is wrapping ``mlflow.flavor.log_model`` under the hood. | ||
Users can use all the arguments in ``mlflow.flavor.log_model`` in ``lineapy.save`` as well. | ||
For instance, if we want to specify ``registered_model_name``, we can write the save statement as: | ||
|
||
.. code:: python | ||
|
||
lineapy.save(model, name="model", storage_backend="mlflow", registered_model_name="clf") | ||
|
||
Retrieve Artifact from Both LineaPy and MLflow | ||
---------------------------------------------- | ||
|
||
Depend on what users want to do (or be familiar with). | ||
Users can retrieve the same artifact(ML model) from LineaPy API and MLflow API once users execute ``lineapy.save`` with ``mlflow`` as the storage backend to save the artifact. | ||
|
||
* Retrieve artifact(model) with LineaPy API | ||
|
||
.. code:: python | ||
|
||
artifact = lineapy.get('model') | ||
lineapy_model = artifact.get_value() | ||
|
||
* Retrieve artifact(model) with Mlflow API | ||
|
||
.. code:: python | ||
|
||
client = mlflow.MlflowClient() | ||
latest_version = client.search_model_versions("name='clf'")[0].version | ||
# This is exactly the same object as `lineapy_model` in previous session | ||
mlflow_model = mlflow.sklearn.load_model(f'models:/clf/{latest_version}') | ||
|
||
Which MLflow Model Flavor is Supported | ||
-------------------------------------- | ||
|
||
Currently, we are supporting following flavors: ``sklearn``, ``xgboost``, ``prophet`` and ``statsmodels``. | ||
We plan to support all MLflow supported model flavors soon. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,23 +11,37 @@ These items are determined by the following order: | |
- Configuration file | ||
- Default values | ||
|
||
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| name | usage | type | default | environmental variables | | ||
+=====================================+===============================+=========+============================================+=================================================+ | ||
| home_dir | LineaPy base folder | Path | `$HOME/.lineapy` | `LINEAPY_HOME_DIR` | | ||
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| artifact_storage_dir | artifact saving folder | Path | `$LINEAPY_HOME_DIR/linea_pickles` | `LINEAPY_ARTIFACT_STORAGE_DIR` | | ||
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| database_url | LineaPy db connection string | string | `sqlite:///$LINEAPY_HOME_DIR/db.sqlite` | `LINEAPY_DATABASE_URL` | | ||
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| customized_annotation_folder | user annotations folder | Path | `$LINEAPY_HOME_DIR/customized_annotations` | `LINEAPY_CUSTOMIZED_ANNOTATION_FOLDER` | | ||
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| do_not_track | disable user analytics | boolean | false | `LINEAPY_DO_NOT_TRACK` | | ||
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| logging_level | logging level | string | INFO | `LINEAPY_LOGGING_LEVEL` | | ||
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| logging_file | logging file path | Path | `$LINEAPY_HOME_DIR/lineapy.log` | `LINEAPY_LOGGING_FILE` | | ||
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
* Core LineaPy configuration items | ||
|
||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| name | usage | type | default | environmental variables | | ||
+=====================================+=======================================+=========+============================================+=================================================+ | ||
| home_dir | LineaPy base folder | Path | `$HOME/.lineapy` | `LINEAPY_HOME_DIR` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| artifact_storage_dir | artifact saving folder | Path | `$LINEAPY_HOME_DIR/linea_pickles` | `LINEAPY_ARTIFACT_STORAGE_DIR` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| database_url | LineaPy db connection string | string | `sqlite:///$LINEAPY_HOME_DIR/db.sqlite` | `LINEAPY_DATABASE_URL` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| customized_annotation_folder | user annotations folder | Path | `$LINEAPY_HOME_DIR/customized_annotations` | `LINEAPY_CUSTOMIZED_ANNOTATION_FOLDER` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| do_not_track | disable user analytics | boolean | false | `LINEAPY_DO_NOT_TRACK` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| logging_level | logging level | string | INFO | `LINEAPY_LOGGING_LEVEL` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| logging_file | logging file path | Path | `$LINEAPY_HOME_DIR/lineapy.log` | `LINEAPY_LOGGING_FILE` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
|
||
* Configuration item for integration with other tools | ||
|
||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| name | usage | type | default | environmental variables | | ||
+=====================================+=======================================+=========+============================================+=================================================+ | ||
| mlflow_tracking_uri | mlflow tracking | string | None | `LINEAPY_MLFLOW_TRACKING_URI` | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe for documentation we should break down "integration specific" configurations into a separate section to prevent this table from becoming massive. |
||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| mlflow_registry_uri | mlflow registry | string | None | `LINEAPY_MLFLOW_REGISTRY_URI` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
| default_ml_models_storage_backend | default storage backend for ml models | string | mlflow | `LINEAPY_DEFAULT_ML_MODELS_STORAGE_BACKEND` | | ||
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+ | ||
|
||
All LineaPy configuration items follow following naming convention; in configuration file, all variable name should be lower case with underscore, | ||
all environmental variable name should be upper case with underscore and all CLI options should be lower case. | ||
|
@@ -107,3 +121,27 @@ Instead, if you want ot use environmental variables, you should configure it thr | |
|
||
Note that, which ``storage_options`` items you can set are depends on the filesystem you are using. | ||
In the following section, we will discuss how to set the storage options for S3. | ||
|
||
Artifact Backend Storage | ||
------------------------ | ||
|
||
When an artifact is also an ML model, you can set the ``mlflow_tracking_uri`` and ``mlflow_registry_uri`` (depending on how your MLflow is configured) to use MLflow as the storage backend for ML models; | ||
i.e., saving the artifact with ``lineapy.save(model, 'model', storage_backend='mlflow')`` to save the artifact(ML model) directly in MLflow but still register in the LineaPy artifact store. | ||
|
||
For instance, if you want to use ``databricks`` as your MLflow tracking URI to save your ML models, you can set them with | ||
|
||
.. code:: python | ||
|
||
lineapy.options.set('mlflow_tracking_uri', 'databricks') | ||
|
||
or you can put it in the LineaPy configuration files, and you can run | ||
|
||
.. code:: python | ||
|
||
lineapy.save(model, 'model', storage_backend='mlflow') | ||
|
||
to save your artifact(ML model) in MLflow while you can still use it as a typical LineaPy artifact. | ||
If the ``model`` is not supported by MLflow, it will fall back to using the standard LineaPy protocol to save the model as an artifact. | ||
|
||
Furthermore, if the ``default_ml_models_storage_backend='mlflow'``(as default when you only set ``mlflow_tracking_uri``), there is no need to specify ``storage_backend='mlflow'`` in the ``lineapy.save`` to save the model in MLflow. | ||
Or you can change to ``default_ml_models_storage_backend='lineapy'``, and save your artifacts(ML models) with LineaPy backend as default and use MLflow when you specify ``storage_backend='mlflow'`` in the ``lineapy.save``. |
42 changes: 42 additions & 0 deletions
42
lineapy/_alembic/versions/07d0db31e15f_mlflow_integration.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
"""mlflow_integration | ||
|
||
Revision ID: 07d0db31e15f | ||
Revises: 4907800d9126 | ||
Create Date: 2022-11-03 16:26:37.217174 | ||
|
||
""" | ||
import sqlalchemy as sa | ||
from alembic import op | ||
|
||
# revision identifiers, used by Alembic. | ||
revision = "07d0db31e15f" | ||
down_revision = "4907800d9126" | ||
branch_labels = None | ||
depends_on = None | ||
|
||
|
||
def upgrade() -> None: | ||
# ### commands auto generated by Alembic - please adjust! ### | ||
op.create_table( | ||
"mlflow_artifact_storage", | ||
sa.Column("id", sa.Integer(), autoincrement=True, nullable=False), | ||
sa.Column("artifact_id", sa.Integer(), nullable=False), | ||
sa.Column("backend", sa.String(), nullable=False), | ||
sa.Column("tracking_uri", sa.String(), nullable=False), | ||
sa.Column("registry_uri", sa.String(), nullable=True), | ||
sa.Column("model_uri", sa.String(), nullable=False), | ||
sa.Column("model_flavor", sa.String(), nullable=False), | ||
sa.Column("delete_time", sa.DateTime(), nullable=True), | ||
sa.ForeignKeyConstraint( | ||
["artifact_id"], | ||
["artifact.id"], | ||
), | ||
sa.PrimaryKeyConstraint("id"), | ||
) | ||
# ### end Alembic commands ### | ||
|
||
|
||
def downgrade() -> None: | ||
# ### commands auto generated by Alembic - please adjust! ### | ||
op.drop_table("mlflow_artifact_storage") | ||
# ### end Alembic commands ### |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the distinction between "storage location" and "storage backend" will be clear to our user. As such, why not combine these two sections? That is, the current PR's contents relating to MLflow can be put under existing
Changing Storage Location
section and be titled "Storing model artifact values in MLflow", i.e.:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you have this comment. However, location and backend are at different layers of abstraction. I think putting them together is more confusing.
MLflow can have its own storage location, it can be local/s3/postgres. In this case, we are basically saying for this type of artifact(ML model), we use MLflow to handle the storage; it could be s3/local/gcp/... and we don't really care, we just need to specify the host of MLflow and MLflow will take care rest of it. However, for LineaPy itself, we are the host of LineaPy. We need to configure the underlying storage location and how the catalog(db) is hosted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the name storage_location and storage_backend also used by mlflow for those two configs? if so, then it is probably already clear for mlflow users as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i think they have a backend store and artifact store. the backend store is configured using --backend-store-uri and the artifact store is configured using --default-artifact-root.
not sure if i'm completely right here but the storage_location here will be the --backend-store-uri of mlflow and storate_backend will be "mlflow"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mingjerli Ok, here's my understanding based on your explanation above:
If this understanding is correct, then I suggest we make things more clear in the landing page of
Changing Storage Backend
section, like so:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lionsardesai I think what you are referring to is how to setup/run an MLflow server, not how end users connect to MLflow. End users are using
mlflow.set_tracking_uri
andmlflow.set_registry_uri
for configuration to connect to MLflow. They don't need to know how the MLflow server has been set up exactly(as long as their IT/ops people tell them which tracking_uri and registry_uri they should use). And the crazy part is you can almost put anything intoset_tracking_uri
, filepath, s3path, database ...@yoonspark agree, added