LineaLabs · mingjerli · Nov 4, 2022 · Oct 20, 2022 · Oct 26, 2022 · Oct 28, 2022
diff --git a/.gitignore b/.gitignore
@@ -179,3 +179,6 @@ Untitled*.ipynb
 tests/outputs
 *.pickle
 .linea/linea_pickles
+
+# mlflow 
+mlruns/
diff --git a/docs/source/autogen/lineapy.api.rst b/docs/source/autogen/lineapy.api.rst
@@ -24,6 +24,12 @@ lineapy.api.api\_utils module
 .. automodule:: lineapy.api.api_utils
    :members:
 
+lineapy.api.artifact\_serializer module
+---------------------------------------
+
+.. automodule:: lineapy.api.artifact_serializer
+   :members:
+
 Module contents
 ---------------
 

diff --git a/docs/source/autogen/lineapy.plugins.rst b/docs/source/autogen/lineapy.plugins.rst
@@ -1,6 +1,14 @@
 lineapy.plugins package
 =======================
 
+Subpackages
+-----------
+
+.. toctree::
+   :maxdepth: 2
+
+   lineapy.plugins.serializers
+
 Submodules
 ----------
 

diff --git a/docs/source/autogen/lineapy.plugins.serializers.rst b/docs/source/autogen/lineapy.plugins.serializers.rst
@@ -0,0 +1,17 @@
+lineapy.plugins.serializers package
+===================================
+
+Submodules
+----------
+
+lineapy.plugins.serializers.mlflow\_io module
+---------------------------------------------
+
+.. automodule:: lineapy.plugins.serializers.mlflow_io
+   :members:
+
+Module contents
+---------------
+
+.. automodule:: lineapy.plugins.serializers
+   :members:
diff --git a/docs/source/guide/manage_artifacts/index.rst b/docs/source/guide/manage_artifacts/index.rst
@@ -39,8 +39,23 @@ imagine how difficult it would be to maintain correlations between the two. Line
 
    Read more about configuration :ref:`here <configurations>`.
 
+
+Storage Backend
+---------------
+
+Out of the box, LineaPy is the default storage backend for all artifacts.
+For certain storage backends in use (e.g., storing model artifacts in MLflow), saving one more copy of the same artifact into LineaPy causes sync issue between the two systems.
+Thus, LineaPy supports using different storage backends for certain data types (e.g., ML models).
+This support is essential for users to leverage functionalities from both LineaPy and other familiar toolkit (e.g., MLflow).
+
+.. note::
+
+   Storage backend refers to the overall system handling storage and should be distinguished from specific storage locations such as Amazon S3.
+   For instance, LineaPy is a storage backend that can use different storage locations.
+
 .. toctree::
    :maxdepth: 1
 
    artifact_reuse
    storage_location/index
+   storage_backend/index
diff --git a/docs/source/guide/manage_artifacts/storage_backend/index.rst b/docs/source/guide/manage_artifacts/storage_backend/index.rst
@@ -0,0 +1,14 @@
+Changing Storage Backend
+========================
+
+Out of the box, LineaPy is the default storage backend for all artifacts.
+For some existing storage systems(MLflow, database ...) used to save artifacts; saving one more copy in LineaPy causes syncing issue between the two systems.
+Thus, LineaPy supports using different storage backends for some data types.
+This support is essential for users to leverage functionalities from both LineaPy and their familiar tools.
+
+Currently, LineaPy supports MLflow as a storage backend for ML models.
+
+.. toctree::
+   :maxdepth: 1
+
+   mlflow
diff --git a/docs/source/guide/manage_artifacts/storage_backend/mlflow.rst b/docs/source/guide/manage_artifacts/storage_backend/mlflow.rst
@@ -0,0 +1,110 @@
+.. _mlflow:
+
+Using MLflow as Storage Backend to Save ML Models
+=================================================
+
+.. include:: ../../../snippets/slack_support.rstinc
+
+By default, LineaPy uses LineaPy to save artifacts for all object types.
+However, for users who have access to MLflow, MLflow might be their first choice to save the ML model.
+Thus, we enable using MLflow as the backend storage for ML models.
+
+Configure MLflow
+----------------
+
+Depend on how our MLflow is configured. We might need to specify ``tracking URI`` and (optional) ``registry URI``in MLflow to start using MLflow. 
+
+.. code:: python
+
+    mlflow.set_tracking_uri('your_mlflow_tracking_uri')
+    mlflow.set_registry_uri('your_mlflow_registry_uri')
+
+To let LineaPy be aware of the existence of MLflow, we need to set corresponding config items if we want to use MLflow as the storage backend for ML models.
+
+.. code:: python
+
+    lineapy.options.set('mlflow_tracking_uri','your_mlflow_tracking_uri')
+    lineapy.options.set('mlflow_registry_uri','your_mlflow_registry_uri')
+
+
+.. note::
+
+    For objects not supported by MLflow, it will fall back to using LineaPy as the storage backend as usual.
+
+Set Default Storage Backend for ML Models
+-----------------------------------------
+
+Each user might have a different usage pattern for MLflow; some might use it for logging purposes and record all developing models. Some might treat it as a public space and only publish models that meet specific criteria to MLflow. 
+In the first case, users want to use MLflow to save artifacts(ML models) by default, and in the second case, users only want to use MLflow to save artifacts when they want.
+Thus, we provide an option(``default_ml_models_storage_backend``) to let users decide the default storage backend for ML models when ``mlflow_tracking_uri`` has been set.
+
+Here are behaviors about which storage backend to use for ML models:
+
+* Only set ``mlflow_tracking_uri`` but not ``default_ml_models_storage_backend``
+
+.. code:: python
+
+    lineapy.options.set("mlflow_tracking_uri", "databricks")
+
+    lineapy.save(model, 'model') # Use MLflow (if mlflow_tracking_uri is set, default value of default_ml_models_storage_backend is mlflow )
+    lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow
+    lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy
+
+
+* Set ``mlflow_tracking_uri`` and ``default_ml_models_storage_backend=='mlflow'``
+
+.. code:: python
+
+    lineapy.options.set("mlflow_tracking_uri", "databricks")
+    lineapy.options.set("default_ml_models_storage_backend", "mlflow")
+    lineapy.save(model, 'model') # Use MLflow
+    lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow
+    lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy
+
+
+* Set ``mlflow_tracking_uri`` and ``default_ml_models_storage_backend=='lineapy'``
+
+.. code:: python
+
+    lineapy.options.set("mlflow_tracking_uri", "databricks")
+    lineapy.options.set("default_ml_models_storage_backend", "lineapy")
+    lineapy.save(model, 'model') # Use LineaPy
+    lineapy.save(model, 'model', storage_backend='mlflow') # Use MLflow
+    lineapy.save(model, 'model', storage_backend='lineapy') # Use LineaPy
+
+Note that when using MLflow as storage backend, ``lineapy.save`` is wrapping ``mlflow.flavor.log_model`` under the hood.
+Users can use all the arguments in ``mlflow.flavor.log_model`` in ``lineapy.save`` as well.
+For instance, if we want to specify ``registered_model_name``, we can write the save statement as:
+
+.. code:: python
+
+    lineapy.save(model, name="model", storage_backend="mlflow", registered_model_name="clf")
+
+Retrieve Artifact from Both LineaPy and MLflow
+----------------------------------------------
+
+Depend on what users want to do (or be familiar with). 
+Users can retrieve the same artifact(ML model) from LineaPy API and MLflow API once users execute ``lineapy.save`` with ``mlflow`` as the storage backend to save the artifact.
+
+* Retrieve artifact(model) with LineaPy API
+
+.. code:: python
+
+    artifact = lineapy.get('model')
+    lineapy_model = artifact.get_value()
+
+* Retrieve artifact(model) with Mlflow API
+
+.. code:: python
+
+    client = mlflow.MlflowClient()
+    latest_version = client.search_model_versions("name='clf'")[0].version
+    # This is exactly the same object as `lineapy_model` in previous session
+    mlflow_model = mlflow.sklearn.load_model(f'models:/clf/{latest_version}')    
+
+Which MLflow Model Flavor is Supported
+--------------------------------------
+
+Currently, we are supporting following flavors: ``sklearn``, ``xgboost``, ``prophet`` and ``statsmodels``.
+We plan to support all MLflow supported model flavors soon.
+
diff --git a/docs/source/references/configurations.rst b/docs/source/references/configurations.rst
@@ -11,23 +11,37 @@ These items are determined by the following order:
 - Configuration file
 - Default values
 
-+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
-| name                                | usage                         | type    | default                                    | environmental variables                         |
-+=====================================+===============================+=========+============================================+=================================================+
-| home_dir                            | LineaPy base folder           | Path    | `$HOME/.lineapy`                           | `LINEAPY_HOME_DIR`                              |
-+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
-| artifact_storage_dir                | artifact saving folder        | Path    | `$LINEAPY_HOME_DIR/linea_pickles`          | `LINEAPY_ARTIFACT_STORAGE_DIR`                  |
-+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
-| database_url                        | LineaPy db connection string  | string  | `sqlite:///$LINEAPY_HOME_DIR/db.sqlite`    | `LINEAPY_DATABASE_URL`                          |
-+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
-| customized_annotation_folder        | user annotations folder       | Path    | `$LINEAPY_HOME_DIR/customized_annotations` | `LINEAPY_CUSTOMIZED_ANNOTATION_FOLDER`          |
-+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
-| do_not_track                        | disable user analytics        | boolean | false                                      | `LINEAPY_DO_NOT_TRACK`                          |
-+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
-| logging_level                       | logging level                 | string  | INFO                                       | `LINEAPY_LOGGING_LEVEL`                         |
-+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
-| logging_file                        | logging file path             | Path    | `$LINEAPY_HOME_DIR/lineapy.log`            | `LINEAPY_LOGGING_FILE`                          | 
-+-------------------------------------+-------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+* Core LineaPy configuration items
+
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| name                                | usage                                 | type    | default                                    | environmental variables                         |
++=====================================+=======================================+=========+============================================+=================================================+
+| home_dir                            | LineaPy base folder                   | Path    | `$HOME/.lineapy`                           | `LINEAPY_HOME_DIR`                              |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| artifact_storage_dir                | artifact saving folder                | Path    | `$LINEAPY_HOME_DIR/linea_pickles`          | `LINEAPY_ARTIFACT_STORAGE_DIR`                  |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| database_url                        | LineaPy db connection string          | string  | `sqlite:///$LINEAPY_HOME_DIR/db.sqlite`    | `LINEAPY_DATABASE_URL`                          |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| customized_annotation_folder        | user annotations folder               | Path    | `$LINEAPY_HOME_DIR/customized_annotations` | `LINEAPY_CUSTOMIZED_ANNOTATION_FOLDER`          |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| do_not_track                        | disable user analytics                | boolean | false                                      | `LINEAPY_DO_NOT_TRACK`                          |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| logging_level                       | logging level                         | string  | INFO                                       | `LINEAPY_LOGGING_LEVEL`                         |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| logging_file                        | logging file path                     | Path    | `$LINEAPY_HOME_DIR/lineapy.log`            | `LINEAPY_LOGGING_FILE`                          |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+
+* Configuration item for integration with other tools
+
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| name                                | usage                                 | type    | default                                    | environmental variables                         |
++=====================================+=======================================+=========+============================================+=================================================+
+| mlflow_tracking_uri                 | mlflow tracking                       | string  | None                                       | `LINEAPY_MLFLOW_TRACKING_URI`                   |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| mlflow_registry_uri                 | mlflow registry                       | string  | None                                       | `LINEAPY_MLFLOW_REGISTRY_URI`                   |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
+| default_ml_models_storage_backend   | default storage backend for ml models | string  | mlflow                                     | `LINEAPY_DEFAULT_ML_MODELS_STORAGE_BACKEND`     |
++-------------------------------------+---------------------------------------+---------+--------------------------------------------+-------------------------------------------------+
 
 All LineaPy configuration items follow following naming convention; in configuration file, all variable name should be lower case with underscore, 
 all environmental variable name should be upper case with underscore and all CLI options should be lower case.
@@ -107,3 +121,27 @@ Instead, if you want ot use environmental variables, you should configure it thr
 
 Note that, which ``storage_options`` items you can set are depends on the filesystem you are using.
 In the following section, we will discuss how to set the storage options for S3.
+
+Artifact Backend Storage
+------------------------
+
+When an artifact is also an ML model, you can set the ``mlflow_tracking_uri`` and ``mlflow_registry_uri`` (depending on how your MLflow is configured) to use MLflow as the storage backend for ML models; 
+i.e., saving the artifact with ``lineapy.save(model, 'model', storage_backend='mlflow')`` to save the artifact(ML model) directly in MLflow but still register in the LineaPy artifact store.
+
+For instance, if you want to use ``databricks`` as your MLflow tracking URI to save your ML models, you can set them with
+
+.. code:: python
+
+    lineapy.options.set('mlflow_tracking_uri', 'databricks')
+
+or you can put it in the LineaPy configuration files, and you can run
+
+.. code:: python
+
+    lineapy.save(model, 'model', storage_backend='mlflow')
+
+to save your artifact(ML model) in MLflow while you can still use it as a typical LineaPy artifact.
+If the ``model`` is not supported by MLflow, it will fall back to using the standard LineaPy protocol to save the model as an artifact.
+
+Furthermore, if the ``default_ml_models_storage_backend='mlflow'``(as default when you only set ``mlflow_tracking_uri``), there is no need to specify ``storage_backend='mlflow'`` in the ``lineapy.save`` to save the model in MLflow.
+Or you can change to ``default_ml_models_storage_backend='lineapy'``, and save your artifacts(ML models) with LineaPy backend as default and use MLflow when you specify ``storage_backend='mlflow'`` in the ``lineapy.save``.
diff --git a/lineapy/_alembic/versions/07d0db31e15f_mlflow_integration.py b/lineapy/_alembic/versions/07d0db31e15f_mlflow_integration.py
@@ -0,0 +1,42 @@
+"""mlflow_integration
+
+Revision ID: 07d0db31e15f
+Revises: 4907800d9126
+Create Date: 2022-11-03 16:26:37.217174
+
+"""
+import sqlalchemy as sa
+from alembic import op
+
+# revision identifiers, used by Alembic.
+revision = "07d0db31e15f"
+down_revision = "4907800d9126"
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    # ### commands auto generated by Alembic - please adjust! ###
+    op.create_table(
+        "mlflow_artifact_storage",
+        sa.Column("id", sa.Integer(), autoincrement=True, nullable=False),
+        sa.Column("artifact_id", sa.Integer(), nullable=False),
+        sa.Column("backend", sa.String(), nullable=False),
+        sa.Column("tracking_uri", sa.String(), nullable=False),
+        sa.Column("registry_uri", sa.String(), nullable=True),
+        sa.Column("model_uri", sa.String(), nullable=False),
+        sa.Column("model_flavor", sa.String(), nullable=False),
+        sa.Column("delete_time", sa.DateTime(), nullable=True),
+        sa.ForeignKeyConstraint(
+            ["artifact_id"],
+            ["artifact.id"],
+        ),
+        sa.PrimaryKeyConstraint("id"),
+    )
+    # ### end Alembic commands ###
+
+
+def downgrade() -> None:
+    # ### commands auto generated by Alembic - please adjust! ###
+    op.drop_table("mlflow_artifact_storage")
+    # ### end Alembic commands ###