Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset creation with local storage: path substitution not working #1233

Open
nfzd opened this issue Mar 21, 2024 · 2 comments
Open

Dataset creation with local storage: path substitution not working #1233

nfzd opened this issue Mar 21, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@nfzd
Copy link

nfzd commented Mar 21, 2024

Describe the bug

  • Created a dataset with a local folder as output_uri.
  • Changed the location of the folder.
  • Added a path substitution rule, but loading the dataset does not work.

To reproduce

Contents of clearml.conf:

api {
    (...)
}

In folder /home/user/clearml_path_substitution.

Contents of file create.py:

from clearml import Dataset

dataset = Dataset.create(
    dataset_project="Test",
    dataset_name="Test-PathSubs",
    output_uri="/home/user/clearml_path_substitution/storage_1")

dataset.add_files(path="./data.xml")
dataset.upload()
dataset.finalize()

Contents of file load.py:

from clearml import Dataset

dataset = Dataset.get(
    dataset_project="Test",
    dataset_name="Test-PathSubs")

Create dataset:

$ mkdir storage_1
$ python3 create.py
ClearML results page: https://(...)/output/log
ClearML dataset page: https://(...)
Uploading dataset changes (1 files compressed to 125 B) to file:///home/user/clearml_path_substitution/storage_1
File compression and upload completed: total size 125 B, 1 chunk(s) stored (average size 125 B)

(Loading it at this point by running load.py works as expected.)

Move the storage location:

$ mv storage_1 storage_2

Add the path substitution rule:

$ (...)
$ cat ~/clearml.conf
api {
    (...)
}
sdk {
    storage {
        path_substitution = [
            # Replace registered links with local prefixes,
            # Solve mapping issues, and allow for external resource caching.
            {
                registered_prefix = "file:///home/user/clearml_path_substitution/storage_1"
                local_prefix = "file:///home/user/clearml_path_substitution/storage_2"
            }
        ]
    }
}

Try loading from the new location:

$ python3 load.py
Traceback (most recent call last):
  File "/home/user/clearml_path_substitution/load.py", line 3, in <module>
    dataset = Dataset.get(
  File "/home/user/miniconda3/envs/clearml/lib/python3.10/site-packages/clearml/datasets/dataset.py", line 1778, in get
    instance = get_instance(dataset_id)
  File "/home/user/miniconda3/envs/clearml/lib/python3.10/site-packages/clearml/datasets/dataset.py", line 1690, in get_instance
    raise ValueError("Could not load Dataset id={} state".format(task.id))
ValueError: Could not load Dataset id=(...) state

Expected behaviour

Loading should be possible from the new storage location using path substitution.

Environment

  • Server type: self hosted
  • ClearML SDK Version: 1.14.0
  • ClearML Server Version: WebApp: 1.14.0-431 • Server: 1.14.0-431 • API: 2.28
  • Python Version: 3.10
  • OS: Linux
@nfzd nfzd added the bug Something isn't working label Mar 21, 2024
@nfzd
Copy link
Author

nfzd commented Apr 9, 2024

Does anyone have an idea what the problem could be or how to debug the issue?

@eugen-ajechiloae-clearml
Copy link
Collaborator

Hi @nfzd ! Looks like the StorageHelper tries to access file:// links directly, without applying file substitution, and if the referenced file does not exist, then the program will raise an error.
We will need to fix this on our side (or if you wish to contribute you could open a PR that handles path substitutions in

def get_direct_access(self, remote_path, **_):
).

The only workaround I can think of is forcing get_direct_access to return None:

from clearml.storage.helper import _FileStorageDriver
_FileStorageDriver.get_direct_access = lambda *args: None

# should work
from clearml import Dataset
d = Dataset.get("d2412eff1f7f462fb6c81065e043cd8b")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants