Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a model from DagsHub repo using get_model_path #447

Merged
merged 15 commits into from
Mar 27, 2024

Conversation

kbolashev
Copy link
Member

@kbolashev kbolashev commented Mar 11, 2024

This pull request adds a function to streamline loading models from a DagsHub Repository: dagshub.models.get_model_path.

This function tries to find a model in the repository and loads it to a local directory, allowing the user to use it with e.g. transformers:

AutoModel.load_pretrained(get_model_path('user/repo'))

Order of lookup for the model dir:

  • .dagshub/model.yaml file with a model_dir key
  • model and models dirs in repo root
  • model and models dirs in DagsHub Storage root

Also has path and bucket arguments that allows the user to load from a specific path/bucket.

The loading has two download modes: lazy and eager. lazy runs install_hooks() in the dir with the model, and eager downloads files into the dir. The resulting paths are consistent between them.

The default location for saving is ~/dagshub/models/<user>/<repo>/, and every model is always stored with its full path in the repository.

Unsolved problems:

  • Model invalidation. Right now, I'm not deleting the downloaded files, so once the model is loaded to a destination, it is there for good and it won't be replaced, until the user manually deletes it.

@kbolashev kbolashev self-assigned this Mar 11, 2024
Copy link

dagshub bot commented Mar 11, 2024

@simonlsk simonlsk self-requested a review March 11, 2024 16:28
@kbolashev
Copy link
Member Author

WDYT: changing the name from get_model_path to get_model

Fixes:
- pass ref into lazy load
- lazy load now loads the transformer patches
Copy link
Contributor

@simonlsk simonlsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,
Very limited remarks and potential bugs.

dagshub/models/model_loaders.py Show resolved Hide resolved
dagshub/models/model_loaders.py Show resolved Hide resolved

@property
def model_path(self) -> Path:
return Path(".dagshub") / "storage" / "s3" / self.repo_api.repo_name / self.path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be easy to track down if and when you change the path to which the bucket is located?
Isn't there a utility function to calculate the bucket path in the file system?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really easy to track down, and this is a bit of a band-aid before we redo the paths for all of the storage in install_hooks().
Right now the model_path is this to be consistent with install_hooks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I meant easy for you to track down the band-aids when you redo the paths?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember them, and the refactor is supposed to be done soon, so I hope I won't forget.
I'll try to DRY it then, maybe in RepoAPI

dagshub/models/model_loaders.py Outdated Show resolved Hide resolved
dagshub/models/model_locator.py Outdated Show resolved Hide resolved
def download_destination(self) -> Path:
if self._download_dest is not None:
return Path(self._download_dest)
return Path(sanitize_filepath(os.path.join(Path.home(), "dagshub", "models", self.repo_api.full_name)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be customizable with an env variable?
Is there already a similar path for dagshub resources? Should this use a common logic with the token cache location for example?

Copy link
Member Author

@kbolashev kbolashev Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to reuse the token cache location for this, because it uses kind of an "AppData" directory, that isn't supposed to be used for file storage
These models will be potentially very big, so they could grow in size a lot, so I want it to be very easily discoverable by users.
We're already using ~/dagshub for data engine files, so it is reusing already existing path in a way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But kind of same comment as before, the different local paths aren't really managed anywhere, and they're spread inline, so I was implying that I expected to see some DRY for how the path is built. This would allow later changes to be less painful. Not critical for now anyway.

Comment on lines 114 to 119
if str_path.startswith("dagshub_storage/"):
return str_path[len("dagshub_storage/"):], StorageType.DagshubStorage
for storage in self.repo_storages:
if str_path.startswith(f"{storage.name}/"):
bucketPath = f"{storage.protocol}/{str_path}"
return bucketPath, StorageType.Bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User here needs some advanced knowledge of necessary naming convention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's true, and right now it's not consistent with other ways we use storages.
This is the format that we'll be using later when we reimplement install_hooks pathing, so the hope is that it will be consistent at that point and they wouldn't need to look anything up.

The dir tree after that change should look like this basically:

.
|-- README.md
|-- dagshub_storage
|-- my_integrated_bucket
`-- src

and the path for the model should be intuitive

dagshub/models/model_locator.py Show resolved Hide resolved
dagshub/models/model_locator.py Outdated Show resolved Hide resolved
tests/model_loading/test_download.py Outdated Show resolved Hide resolved
@kbolashev kbolashev requested a review from simonlsk March 24, 2024 12:06
@kbolashev
Copy link
Member Author

@kbolashev kbolashev merged commit 2177bc2 into master Mar 27, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants