Data Engine: Autolog to MLflow on datasource queries #463

kbolashev · 2024-04-14T08:13:32Z

Implemented in this PR:

Added a link to open the query in the gallery view
Added autologging only on ds.all() of queries to MLflow
The logging happens in a parallel thread so as to not interfere with the querying
Querying works with multiple datasources, correctly saving the query for each ds.
A new run is created in the last modified experiment per datasource that's being queried, all the queries are then logged into that run.

Left to implement:

RIght now the artifacts all get logged with the same name, so they are being overwritten every time. I need to add a tracker for how many logs has been saved.
Probably would be a good idea to have the autologging togglable in some ways, because the runs created are not being closed and are going to pollute the environment, so I expect advanced users to be able to turn this off.

Caveats/bugs:

Closing a run doesn't work if there's a logging session going on to another repo's MLflow. Possible solution - if the current datasource is not in the active run's repo, log in foreground instead of background.

kbolashev · 2024-04-14T15:58:49Z

Changes to be done:

Log ONLY to the active run, no point in creating a run on the datasource's repo as it turns out
Come up with a better naming in these cases (probably log the name of the repo+datasource+index), would be good if the index could be gotten from mlflow also.

kbolashev · 2024-04-30T12:49:40Z

Changed it so it logs to the current one.
The format is autolog_<datasource_name>_<#>.dagshub.json. Decided that adding the repo is a bit too verbose.

Example:

dagshub/data_engine/model/datasource.py

guysmoilov · 2024-04-30T19:04:44Z

dagshub/data_engine/model/datasource.py

+            counter_value = _autolog_counters[source_name]
+            _autolog_counters[source_name] += 1
+        artifact_name = f"autolog_{source_name}_{counter_value}.dagshub.json"
+        threading.Thread(target=self.log_to_mlflow, kwargs={"artifact_name": artifact_name}).start()


Would be wise to pass along the active run as a parameter

And, are you very sure that log_to_mlflow is safe to do async?
The datasource object won't be modified, or otherwise be made into garbage by the time log_to_mlflow gets executed?

The datasource object is deep copied on every modification, so log_to_mlflowis guaranteed to be working with the same object

guysmoilov · 2024-05-01T08:01:38Z

dagshub/data_engine/model/datasource.py

+
+        artifact_name = f"autolog_{source_name}_{now_time}_{uuid_chunk}.dagshub.json"
+        threading.Thread(
+            target=self.log_to_mlflow, kwargs={"artifact_name": artifact_name, "run": mlflow.active_run()}


Might want to reuse the active_run() from line 753

kbolashev added the enhancement New feature or request label Apr 14, 2024

kbolashev self-assigned this Apr 14, 2024

kbolashev added 2 commits April 30, 2024 10:51

WIP

42cb61d

Fix doc typo

aa07a91

kbolashev force-pushed the feature/mlflow-autolog branch from ba02535 to aa07a91 Compare April 30, 2024 12:41

kbolashev added 2 commits April 30, 2024 15:41

Merge branch 'refs/heads/master' into feature/mlflow-autolog

afa1270

Add the visualize url to the saved mlflow artifact

e4b7c25

kbolashev marked this pull request as ready for review April 30, 2024 12:51

kbolashev requested a review from guysmoilov April 30, 2024 12:52

guysmoilov requested changes Apr 30, 2024

View reviewed changes

Change name format of the autologged mlflow artifact

708bddc

kbolashev requested a review from guysmoilov May 1, 2024 07:50

guysmoilov approved these changes May 1, 2024

View reviewed changes

Reuse active run

2e1c3b9

kbolashev merged commit e117d34 into master May 1, 2024
8 checks passed

kbolashev deleted the feature/mlflow-autolog branch May 1, 2024 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Engine: Autolog to MLflow on datasource queries #463

Data Engine: Autolog to MLflow on datasource queries #463

kbolashev commented Apr 14, 2024

kbolashev commented Apr 14, 2024

kbolashev commented Apr 30, 2024 •

edited

guysmoilov Apr 30, 2024

guysmoilov Apr 30, 2024

kbolashev May 1, 2024

guysmoilov May 1, 2024 •

edited

Data Engine: Autolog to MLflow on datasource queries #463

Data Engine: Autolog to MLflow on datasource queries #463

Conversation

kbolashev commented Apr 14, 2024

Implemented in this PR:

Left to implement:

Caveats/bugs:

kbolashev commented Apr 14, 2024

kbolashev commented Apr 30, 2024 • edited

guysmoilov Apr 30, 2024

Choose a reason for hiding this comment

guysmoilov Apr 30, 2024

Choose a reason for hiding this comment

kbolashev May 1, 2024

Choose a reason for hiding this comment

guysmoilov May 1, 2024 • edited

Choose a reason for hiding this comment

kbolashev commented Apr 30, 2024 •

edited

guysmoilov May 1, 2024 •

edited