-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading a model from DagsHub repo using get_model_path
#447
Conversation
WDYT: changing the name from |
Fixes: - pass ref into lazy load - lazy load now loads the transformer patches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,
Very limited remarks and potential bugs.
dagshub/models/model_loaders.py
Outdated
|
||
@property | ||
def model_path(self) -> Path: | ||
return Path(".dagshub") / "storage" / "s3" / self.repo_api.repo_name / self.path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be easy to track down if and when you change the path to which the bucket is located?
Isn't there a utility function to calculate the bucket path in the file system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really easy to track down, and this is a bit of a band-aid before we redo the paths for all of the storage in install_hooks()
.
Right now the model_path is this to be consistent with install_hooks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I meant easy for you to track down the band-aids when you redo the paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember them, and the refactor is supposed to be done soon, so I hope I won't forget.
I'll try to DRY it then, maybe in RepoAPI
dagshub/models/model_locator.py
Outdated
def download_destination(self) -> Path: | ||
if self._download_dest is not None: | ||
return Path(self._download_dest) | ||
return Path(sanitize_filepath(os.path.join(Path.home(), "dagshub", "models", self.repo_api.full_name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be customizable with an env variable?
Is there already a similar path for dagshub resources? Should this use a common logic with the token cache location for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to reuse the token cache location for this, because it uses kind of an "AppData" directory, that isn't supposed to be used for file storage
These models will be potentially very big, so they could grow in size a lot, so I want it to be very easily discoverable by users.
We're already using ~/dagshub
for data engine files, so it is reusing already existing path in a way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But kind of same comment as before, the different local paths aren't really managed anywhere, and they're spread inline, so I was implying that I expected to see some DRY for how the path is built. This would allow later changes to be less painful. Not critical for now anyway.
dagshub/models/model_locator.py
Outdated
if str_path.startswith("dagshub_storage/"): | ||
return str_path[len("dagshub_storage/"):], StorageType.DagshubStorage | ||
for storage in self.repo_storages: | ||
if str_path.startswith(f"{storage.name}/"): | ||
bucketPath = f"{storage.protocol}/{str_path}" | ||
return bucketPath, StorageType.Bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User here needs some advanced knowledge of necessary naming convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's true, and right now it's not consistent with other ways we use storages.
This is the format that we'll be using later when we reimplement install_hooks pathing, so the hope is that it will be consistent at that point and they wouldn't need to look anything up.
The dir tree after that change should look like this basically:
.
|-- README.md
|-- dagshub_storage
|-- my_integrated_bucket
`-- src
and the path for the model should be intuitive
This pull request adds a function to streamline loading models from a DagsHub Repository:
dagshub.models.get_model_path
.This function tries to find a model in the repository and loads it to a local directory, allowing the user to use it with e.g. transformers:
Order of lookup for the model dir:
.dagshub/model.yaml
file with amodel_dir
keymodel
andmodels
dirs in repo rootmodel
andmodels
dirs in DagsHub Storage rootAlso has
path
andbucket
arguments that allows the user to load from a specific path/bucket.The loading has two download modes:
lazy
andeager
.lazy
runs install_hooks() in the dir with the model, andeager
downloads files into the dir. The resulting paths are consistent between them.The default location for saving is
~/dagshub/models/<user>/<repo>/
, and every model is always stored with its full path in the repository.Unsolved problems: