Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lin 673 lineapy.get for MLflow #829

Merged
merged 14 commits into from
Nov 4, 2022
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -179,3 +179,6 @@ Untitled*.ipynb
tests/outputs
*.pickle
.linea/linea_pickles

# mlflow
mlruns/
6 changes: 6 additions & 0 deletions docs/source/autogen/lineapy.api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@ lineapy.api.api\_utils module
.. automodule:: lineapy.api.api_utils
:members:

lineapy.api.artifact\_serializer module
---------------------------------------

.. automodule:: lineapy.api.artifact_serializer
:members:

Module contents
---------------

Expand Down
8 changes: 8 additions & 0 deletions docs/source/autogen/lineapy.plugins.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
lineapy.plugins package
=======================

Subpackages
-----------

.. toctree::
:maxdepth: 2

lineapy.plugins.serializers

Submodules
----------

Expand Down
17 changes: 17 additions & 0 deletions docs/source/autogen/lineapy.plugins.serializers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
lineapy.plugins.serializers package
===================================

Submodules
----------

lineapy.plugins.serializers.mlflow\_io module
---------------------------------------------

.. automodule:: lineapy.plugins.serializers.mlflow_io
:members:

Module contents
---------------

.. automodule:: lineapy.plugins.serializers
:members:
15 changes: 15 additions & 0 deletions docs/source/guide/manage_artifacts/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,23 @@ imagine how difficult it would be to maintain correlations between the two. Line

Read more about configuration :ref:`here <configurations>`.


Storage Backend
---------------

Out of the box, LineaPy is the default storage backend for all artifacts.
For certain storage backends in use (e.g., storing model artifacts in MLflow), saving one more copy of the same artifact into LineaPy causes sync issue between the two systems.
Thus, LineaPy supports using different storage backends for certain data types (e.g., ML models).
This support is essential for users to leverage functionalities from both LineaPy and other familiar toolkit (e.g., MLflow).

.. note::

Storage backend refers to the overall system handling storage and should be distinguished from specific storage locations such as Amazon S3.
For instance, LineaPy is a storage backend that can use different storage locations.

.. toctree::
:maxdepth: 1

artifact_reuse
storage_location/index
storage_backend/index
Comment on lines 60 to +61
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the distinction between "storage location" and "storage backend" will be clear to our user. As such, why not combine these two sections? That is, the current PR's contents relating to MLflow can be put under existing Changing Storage Location section and be titled "Storing model artifact values in MLflow", i.e.:

Changing Storage Location
  - Storing Artifact Metadata in PostgreSQL
  - Storing Artifact Values in Amazon S3
  - Storing Model Artifact Values in MLflow

Copy link
Contributor Author

@mingjerli mingjerli Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you have this comment. However, location and backend are at different layers of abstraction. I think putting them together is more confusing.

MLflow can have its own storage location, it can be local/s3/postgres. In this case, we are basically saying for this type of artifact(ML model), we use MLflow to handle the storage; it could be s3/local/gcp/... and we don't really care, we just need to specify the host of MLflow and MLflow will take care rest of it. However, for LineaPy itself, we are the host of LineaPy. We need to configure the underlying storage location and how the catalog(db) is hosted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the name storage_location and storage_backend also used by mlflow for those two configs? if so, then it is probably already clear for mlflow users as you said.

Copy link
Contributor

@lionsardesai lionsardesai Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i think they have a backend store and artifact store. the backend store is configured using --backend-store-uri and the artifact store is configured using --default-artifact-root.

not sure if i'm completely right here but the storage_location here will be the --backend-store-uri of mlflow and storate_backend will be "mlflow"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mingjerli Ok, here's my understanding based on your explanation above:

image

If this understanding is correct, then I suggest we make things more clear in the landing page of Changing Storage Backend section, like so:

Out of the box, LineaPy is the default storage backend for all artifacts. For certain storage backends in use (e.g., storing model artifacts in MLflow), saving one more copy of the same artifact into LineaPy causes sync issue between the two systems. Thus, LineaPy supports using different storage backends for certain data types (e.g., ML models). This support is essential for users to leverage functionalities from both LineaPy and other familiar toolkit (e.g., MLflow).

NOTE: Storage backend refers to the overall system handling storage and should be distinguished from specific storage locations such as Amazon S3. For instance, LineaPy is a storage backend that can use different storage locations.

  • Using MLflow as Storage Backend to Save ML Models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lionsardesai I think what you are referring to is how to setup/run an MLflow server, not how end users connect to MLflow. End users are using mlflow.set_tracking_uri and mlflow.set_registry_uri for configuration to connect to MLflow. They don't need to know how the MLflow server has been set up exactly(as long as their IT/ops people tell them which tracking_uri and registry_uri they should use). And the crazy part is you can almost put anything into set_tracking_uri, filepath, s3path, database ...

@yoonspark agree, added

14 changes: 14 additions & 0 deletions docs/source/guide/manage_artifacts/storage_backend/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Changing Storage Backend
========================

Out of the box, LineaPy is the default storage backend for all artifacts.
For some existing storage systems(MLflow, database ...) used to save artifacts; saving one more copy in LineaPy causes syncing issue between the two systems.
Thus, LineaPy supports using different storage backends for some data types.
This support is essential for users to leverage functionalities from both LineaPy and their familiar tools.

Currently, LineaPy supports MLflow as a storage backend for ML models.

.. toctree::
:maxdepth: 1

mlflow
110 changes: 110 additions & 0 deletions docs/source/guide/manage_artifacts/storage_backend/mlflow.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
.. _mlflow:

Using MLflow as Storage Backend to Save ML Models
=================================================

.. include:: ../../../snippets/slack_support.rstinc

By default, LineaPy uses LineaPy to save artifacts for all object types.
However, for users who have access to MLflow, MLflow might be their first choice to save the ML model.
Thus, we enable using MLflow as the backend storage for ML models.

Configure MLflow
----------------

Depend on how our MLflow is configured. We might need to specify ``tracking URI`` and (optional) ``registry URI``in MLflow to start using MLflow.

.. code:: python

mlflow.set_tracking_uri('your_mlflow_tracking_uri')
mlflow.set_registry_uri('your_mlflow_registry_uri')

To let LineaPy be aware of the existence of MLflow, we need to set corresponding config items if we want to use MLflow as the storage backend for ML models.

.. code:: python

lineapy.options.set('mlflow_tracking_uri','your_mlflow_tracking_uri')
lineapy.options.set('mlflow_registry_uri','your_mlflow_registry_uri')


.. note::

For objects not supported by MLflow, it will fall back to using LineaPy as the storage backend as usual.

Set Default Storage Backend for ML Models
-----------------------------------------

Each user might have a different usage pattern for MLflow; some might use it for logging purposes and record all developing models. Some might treat it as a public space and only publish models that meet specific criteria to MLflow.
In the first case, users want to use MLflow to save artifacts(ML models) by default, and in the second case, users only want to use MLflow to save artifacts when they want.
Thus, we provide an option(``default_ml_models_storage_backend``) to let users decide the default storage backend for ML models when ``mlflow_tracking_uri`` has been set.

Here are behaviors about which storage backend to use for ML models:

* Only set ``mlflow_tracking_uri`` but not ``default_ml_models_storage_backend``

.. code:: python

lineapy.options.set("mlflow_tracking_uri", "databricks")

lineapy.save(model, 'model') # Use MLflow (if mlflow_tracking_uri is set, default value of default_ml_models_storage_backend is mlflow )
lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow
lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy


* Set ``mlflow_tracking_uri`` and ``default_ml_models_storage_backend=='mlflow'``

.. code:: python

lineapy.options.set("mlflow_tracking_uri", "databricks")
lineapy.options.set("default_ml_models_storage_backend", "mlflow")
lineapy.save(model, 'model') # Use MLflow
lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow
lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy


* Set ``mlflow_tracking_uri`` and ``default_ml_models_storage_backend=='lineapy'``

.. code:: python

lineapy.options.set("mlflow_tracking_uri", "databricks")
lineapy.options.set("default_ml_models_storage_backend", "lineapy")
lineapy.save(model, 'model') # Use LineaPy
lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow
lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy

Note that when using MLflow as storage backend, ``lineapy.save`` is wrapping ``mlflow.flavor.log_model`` under the hood.
Users can use all the arguments in ``mlflow.flavor.log_model`` in ``lineapy.save`` as well.
For instance, if we want to specify ``registered_model_name``, we can write the save statement as:

.. code:: python

lineapy.save(model, name="model", storage_backend="mlflow", registered_model_name="clf")

Retrieve Artifact from Both LineaPy and MLflow
----------------------------------------------

Depend on what users want to do (or be familiar with).
Users can retrieve the same artifact(ML model) from LineaPy API and MLflow API once users execute ``lineapy.save`` with ``mlflow`` as the storage backend to save the artifact.

* Retrieve artifact(model) with LineaPy API

.. code:: python

artifact = lineapy.get('model')
lineapy_model = artifact.get_value()

* Retrieve artifact(model) with Mlflow API

.. code:: python

client = mlflow.MlflowClient()
latest_version = client.search_model_versions("name='clf'")[0].version
# This is exactly the same object as `lineapy_model` in previous session
mlflow_model = mlflow.sklearn.load_model(f'models:/clf/{latest_version}')

Which MLflow Model Flavor is Supported
--------------------------------------

Currently, we are supporting following flavors: ``sklearn``, ``xgboost``, ``prophet`` and ``statsmodels``.
We plan to support all MLflow supported model flavors soon.

72 changes: 55 additions & 17 deletions docs/source/references/configurations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,37 @@ These items are determined by the following order:
- Configuration file
- Default values

+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| name | usage | type | default | environmental variables |
+=====================================+===============================+=========+============================================+=================================================+
| home_dir | LineaPy base folder | Path | `$HOME/.lineapy` | `LINEAPY_HOME_DIR` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| artifact_storage_dir | artifact saving folder | Path | `$LINEAPY_HOME_DIR/linea_pickles` | `LINEAPY_ARTIFACT_STORAGE_DIR` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| database_url | LineaPy db connection string | string | `sqlite:///$LINEAPY_HOME_DIR/db.sqlite` | `LINEAPY_DATABASE_URL` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| customized_annotation_folder | user annotations folder | Path | `$LINEAPY_HOME_DIR/customized_annotations` | `LINEAPY_CUSTOMIZED_ANNOTATION_FOLDER` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| do_not_track | disable user analytics | boolean | false | `LINEAPY_DO_NOT_TRACK` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| logging_level | logging level | string | INFO | `LINEAPY_LOGGING_LEVEL` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| logging_file | logging file path | Path | `$LINEAPY_HOME_DIR/lineapy.log` | `LINEAPY_LOGGING_FILE` |
+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
* Core LineaPy configuration items

+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| name | usage | type | default | environmental variables |
+=====================================+=======================================+=========+============================================+=================================================+
| home_dir | LineaPy base folder | Path | `$HOME/.lineapy` | `LINEAPY_HOME_DIR` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| artifact_storage_dir | artifact saving folder | Path | `$LINEAPY_HOME_DIR/linea_pickles` | `LINEAPY_ARTIFACT_STORAGE_DIR` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| database_url | LineaPy db connection string | string | `sqlite:///$LINEAPY_HOME_DIR/db.sqlite` | `LINEAPY_DATABASE_URL` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| customized_annotation_folder | user annotations folder | Path | `$LINEAPY_HOME_DIR/customized_annotations` | `LINEAPY_CUSTOMIZED_ANNOTATION_FOLDER` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| do_not_track | disable user analytics | boolean | false | `LINEAPY_DO_NOT_TRACK` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| logging_level | logging level | string | INFO | `LINEAPY_LOGGING_LEVEL` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| logging_file | logging file path | Path | `$LINEAPY_HOME_DIR/lineapy.log` | `LINEAPY_LOGGING_FILE` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+

* Configuration item for integration with other tools

+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| name | usage | type | default | environmental variables |
+=====================================+=======================================+=========+============================================+=================================================+
| mlflow_tracking_uri | mlflow tracking | string | None | `LINEAPY_MLFLOW_TRACKING_URI` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for documentation we should break down "integration specific" configurations into a separate section to prevent this table from becoming massive.

+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| mlflow_registry_uri | mlflow registry | string | None | `LINEAPY_MLFLOW_REGISTRY_URI` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
| default_ml_models_storage_backend | default storage backend for ml models | string | mlflow | `LINEAPY_DEFAULT_ML_MODELS_STORAGE_BACKEND` |
+-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+

All LineaPy configuration items follow following naming convention; in configuration file, all variable name should be lower case with underscore,
all environmental variable name should be upper case with underscore and all CLI options should be lower case.
Expand Down Expand Up @@ -107,3 +121,27 @@ Instead, if you want ot use environmental variables, you should configure it thr

Note that, which ``storage_options`` items you can set are depends on the filesystem you are using.
In the following section, we will discuss how to set the storage options for S3.

Artifact Backend Storage
------------------------

When an artifact is also an ML model, you can set the ``mlflow_tracking_uri`` and ``mlflow_registry_uri`` (depending on how your MLflow is configured) to use MLflow as the storage backend for ML models;
i.e., saving the artifact with ``lineapy.save(model, 'model', storage_backend='mlflow')`` to save the artifact(ML model) directly in MLflow but still register in the LineaPy artifact store.

For instance, if you want to use ``databricks`` as your MLflow tracking URI to save your ML models, you can set them with

.. code:: python

lineapy.options.set('mlflow_tracking_uri', 'databricks')

or you can put it in the LineaPy configuration files, and you can run

.. code:: python

lineapy.save(model, 'model', storage_backend='mlflow')

to save your artifact(ML model) in MLflow while you can still use it as a typical LineaPy artifact.
If the ``model`` is not supported by MLflow, it will fall back to using the standard LineaPy protocol to save the model as an artifact.

Furthermore, if the ``default_ml_models_storage_backend='mlflow'``(as default when you only set ``mlflow_tracking_uri``), there is no need to specify ``storage_backend='mlflow'`` in the ``lineapy.save`` to save the model in MLflow.
Or you can change to ``default_ml_models_storage_backend='lineapy'``, and save your artifacts(ML models) with LineaPy backend as default and use MLflow when you specify ``storage_backend='mlflow'`` in the ``lineapy.save``.
42 changes: 42 additions & 0 deletions lineapy/_alembic/versions/07d0db31e15f_mlflow_integration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
"""mlflow_integration

Revision ID: 07d0db31e15f
Revises: 4907800d9126
Create Date: 2022-11-03 16:26:37.217174

"""
import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = "07d0db31e15f"
down_revision = "4907800d9126"
branch_labels = None
depends_on = None


def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.create_table(
"mlflow_artifact_storage",
sa.Column("id", sa.Integer(), autoincrement=True, nullable=False),
sa.Column("artifact_id", sa.Integer(), nullable=False),
sa.Column("backend", sa.String(), nullable=False),
sa.Column("tracking_uri", sa.String(), nullable=False),
sa.Column("registry_uri", sa.String(), nullable=True),
sa.Column("model_uri", sa.String(), nullable=False),
sa.Column("model_flavor", sa.String(), nullable=False),
sa.Column("delete_time", sa.DateTime(), nullable=True),
sa.ForeignKeyConstraint(
["artifact_id"],
["artifact.id"],
),
sa.PrimaryKeyConstraint("id"),
)
# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.drop_table("mlflow_artifact_storage")
# ### end Alembic commands ###
Loading