You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello there! First off, I want to thank you for this great plugin :)
The problem
I'm using modular pipelines and dataset factories. When using the pipeline_ml_factory to deploy pipelines to MLFlow, I've encountered issues with KedroMLFlow not recognizing catalogue entries with factory patterns.
Use case example:
Let's say we have a model pipeline that we want to run on different datasets. To differentiate the datasets, we use namespaces.
When running the pipeline, KedroMLFlow does not recognise dataset_a.model as an artefact that should be uploaded. It instead throws an error:
kedro_mlflow.mlflow.kedro_pipeline_model.KedroPipelineModelError: The provided catalog must contains 'dataset_a.model' dataset since it is the input of the pipeline.
But then I lose the benefits of the nice naming patterns :)
Context
This is important since I am working on a project where we re-use the model pipelines across many datasets. We separate the datasets using namespaces. Hence, it would greatly help to use the naming patterns to reduce the work every time a new dataset is added.
And use it like this in the after_pipeline_run hook:
...
resolved_catalog=self._resolve_catalog(self.context, catalog, pipeline)
withTemporaryDirectory() astmp_dir:
# This will be removed at the end of the context manager,# but we need to log in mlflow before moving the folderkedro_pipeline_model=KedroPipelineModel(
pipeline=pipeline.inference,
catalog=resolved_catalog,
input_name=pipeline.input_name,
**pipeline.kpm_kwargs,
)
...
Could this be a potential solution to the problem? Or is there a simpler way that I have totally missed :)
The text was updated successfully, but these errors were encountered:
Glad to see you're using the pipeline_ml_factory, this is an underestimated feature of the plugin which is not well know I guess ;)
Good catch, I've seen a bunch of issues like this since the release of dataset factories. I think we can just call dataset.exists() for each datasets to force the catalog to materialize them, which should make your code much simpler, something like this :
Thank you for that suggestion, @Galileo-Galilei; that seemed to do the trick! I opened a PR: #519, but unfortunately, I missed linking to this issue or set you as a reviewer 🙈 I can't seem to edit it now after I opened it.
Description
Hello there! First off, I want to thank you for this great plugin :)
The problem
I'm using modular pipelines and dataset factories. When using the
pipeline_ml_factory
to deploy pipelines to MLFlow, I've encountered issues with KedroMLFlow not recognizing catalogue entries with factory patterns.Use case example:
Let's say we have a model pipeline that we want to run on different datasets. To differentiate the datasets, we use namespaces.
If we have the current setup:
Our catalogue contains the following entries to make sure we are persisting the data and model separated by namespace:
If we setup the PipelineML instance for pipeline a:
When running the pipeline, KedroMLFlow does not recognise
dataset_a.model
as an artefact that should be uploaded. It instead throws an error:kedro_mlflow.mlflow.kedro_pipeline_model.KedroPipelineModelError: The provided catalog must contains 'dataset_a.model' dataset since it is the input of the pipeline.
It works if I add the full name to the catalogue:
But then I lose the benefits of the nice naming patterns :)
Context
This is important since I am working on a project where we re-use the model pipelines across many datasets. We separate the datasets using namespaces. Hence, it would greatly help to use the naming patterns to reduce the work every time a new dataset is added.
Possible Implementation
This line raises the exception:
https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow/mlflow/kedro_pipeline_model.py#L119
It turns out that the
DataCatalog
passed to theKedroPipelineModel
constructor in theafter_pipeline_run
hook does not contain catalogue entries with factory patterns:https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0033c5072c929a4c26cfaeaf61fcedf93d36522/kedro_mlflow/framework/hooks/mlflow_hook.py#L353C1-L366
My workaround for now is to update the
DataCatalog
to resolve the factory patterns before sending it toKedroPipelineModel
:I use this approach from
kedro catalog resolve
. I add a helper function tomlflow_hooks.py
:And use it like this in the
after_pipeline_run
hook:Could this be a potential solution to the problem? Or is there a simpler way that I have totally missed :)
The text was updated successfully, but these errors were encountered: