Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Engine: Autolog to MLflow on datasource queries #463

Merged
merged 6 commits into from
May 1, 2024

Conversation

kbolashev
Copy link
Member

Implemented in this PR:

  • Added a link to open the query in the gallery view
  • Added autologging only on ds.all() of queries to MLflow
  • The logging happens in a parallel thread so as to not interfere with the querying
  • Querying works with multiple datasources, correctly saving the query for each ds.
  • A new run is created in the last modified experiment per datasource that's being queried, all the queries are then logged into that run.

Left to implement:

  • RIght now the artifacts all get logged with the same name, so they are being overwritten every time. I need to add a tracker for how many logs has been saved.
  • Probably would be a good idea to have the autologging togglable in some ways, because the runs created are not being closed and are going to pollute the environment, so I expect advanced users to be able to turn this off.

Caveats/bugs:

  • Closing a run doesn't work if there's a logging session going on to another repo's MLflow. Possible solution - if the current datasource is not in the active run's repo, log in foreground instead of background.

@kbolashev kbolashev added the enhancement New feature or request label Apr 14, 2024
@kbolashev kbolashev self-assigned this Apr 14, 2024
@kbolashev
Copy link
Member Author

Changes to be done:

  • Log ONLY to the active run, no point in creating a run on the datasource's repo as it turns out
  • Come up with a better naming in these cases (probably log the name of the repo+datasource+index), would be good if the index could be gotten from mlflow also.

@kbolashev
Copy link
Member Author

kbolashev commented Apr 30, 2024

Changed it so it logs to the current one.
The format is autolog_<datasource_name>_<#>.dagshub.json. Decided that adding the repo is a bit too verbose.

Example:
image

@kbolashev kbolashev marked this pull request as ready for review April 30, 2024 12:51
dagshub/data_engine/model/datasource.py Outdated Show resolved Hide resolved
counter_value = _autolog_counters[source_name]
_autolog_counters[source_name] += 1
artifact_name = f"autolog_{source_name}_{counter_value}.dagshub.json"
threading.Thread(target=self.log_to_mlflow, kwargs={"artifact_name": artifact_name}).start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be wise to pass along the active run as a parameter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, are you very sure that log_to_mlflow is safe to do async?
The datasource object won't be modified, or otherwise be made into garbage by the time log_to_mlflow gets executed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datasource object is deep copied on every modification, so log_to_mlflowis guaranteed to be working with the same object

@kbolashev kbolashev requested a review from guysmoilov May 1, 2024 07:50

artifact_name = f"autolog_{source_name}_{now_time}_{uuid_chunk}.dagshub.json"
threading.Thread(
target=self.log_to_mlflow, kwargs={"artifact_name": artifact_name, "run": mlflow.active_run()}
Copy link
Member

@guysmoilov guysmoilov May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to reuse the active_run() from line 753

@kbolashev kbolashev merged commit e117d34 into master May 1, 2024
8 checks passed
@kbolashev kbolashev deleted the feature/mlflow-autolog branch May 1, 2024 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants