Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file datatype type to support saving and reading files/folders in artifact_store #1805

Merged
merged 6 commits into from
Feb 23, 2024

Conversation

jieguangzhou
Copy link
Collaborator

@jieguangzhou jieguangzhou commented Feb 21, 2024

Description

#1789

Related Issues

Checklist

  • Is this code covered by new or existing unit tests or integration tests?
  • Did you run make unit-testing and make integration-testing successfully?
  • Do new classes, functions, methods and parameters all have docstrings?
  • Were existing docstrings updated, if necessary?
  • Was external documentation updated, if necessary?

Additional Notes or Comments

@jieguangzhou jieguangzhou changed the title Add file datatype type to support saving and reading files/folders in… Add file datatype type to support saving and reading files/folders in artifact_store Feb 21, 2024
@codecov-commenter
Copy link

codecov-commenter commented Feb 21, 2024

Codecov Report

Attention: Patch coverage is 65.21739% with 64 lines in your changes are missing coverage. Please review.

Project coverage is 65.91%. Comparing base (34830a7) to head (61ced9a).
Report is 1467 commits behind head on main.

Files Patch % Lines
superduperdb/backends/mongodb/artifacts.py 15.38% 55 Missing ⚠️
superduperdb/backends/base/artifact.py 82.75% 5 Missing ⚠️
superduperdb/components/datatype.py 91.48% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1805       +/-   ##
===========================================
- Coverage   80.33%   65.91%   -14.43%     
===========================================
  Files          95      124       +29     
  Lines        6602     8842     +2240     
===========================================
+ Hits         5304     5828      +524     
- Misses       1298     3014     +1716     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jieguangzhou jieguangzhou added this to the #1789 milestone Feb 21, 2024
@jieguangzhou jieguangzhou marked this pull request as draft February 21, 2024 15:28
@jieguangzhou jieguangzhou self-assigned this Feb 21, 2024
@jieguangzhou jieguangzhou marked this pull request as ready for review February 21, 2024 15:33
@blythed blythed removed this from the #1789 milestone Feb 22, 2024
@jieguangzhou jieguangzhou force-pushed the feat/save-directory branch 5 times, most recently from 7f7c213 to 78ebe53 Compare February 22, 2024 16:12
@jieguangzhou jieguangzhou force-pushed the feat/save-directory branch 2 times, most recently from 34609c5 to 54e0bd3 Compare February 23, 2024 07:06
Copy link
Collaborator

@blythed blythed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - need to sync on plan for various Encodable versions.

superduperdb/backends/base/artifact.py Outdated Show resolved Hide resolved
superduperdb/backends/mongodb/artifacts.py Outdated Show resolved Hide resolved
superduperdb/components/datatype.py Outdated Show resolved Hide resolved
test/integration/artifacts/test_mongodb.py Outdated Show resolved Hide resolved
superduperdb/base/document.py Outdated Show resolved Hide resolved
test/unittest/backends/local/test_artifacts.py Outdated Show resolved Hide resolved
superduperdb/backends/base/artifact.py Outdated Show resolved Hide resolved
superduperdb/base/datalayer.py Show resolved Hide resolved
save_path = os.path.join(file_id_folder, name)
logging.info(f"Copying file {file_path} to {save_path}")
if path.is_dir():
shutil.copytree(file_path, save_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jieguangzhou if there is a partially copy and program crashes may be a rollback?
possibly in future if not in this pr.

Ignore if already happens in shut.copy*

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to do this in future, not only filesystem, all the artifact store need to support this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled in db.add. We need to be able to roll back the whole add process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Deleting already added artifacts
  • Cancelling any running jobs
  • Removing computed outputs

Not an easy task.

save_path = os.path.join(file_id_folder, name)
logging.info(f"Copying file {file_path} to {save_path}")
if path.is_dir():
shutil.copytree(file_path, save_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jieguangzhou @blythed
Since we are copying it in artifact as a snapshot of the directory, we need to caution users that if they change the source directory after db.load , it will not be reflected in the component.
although it is intended to be like this

superduperdb/backends/mongodb/artifacts.py Outdated Show resolved Hide resolved
"""Download file or folder from GridFS and return the path"""
temp_dir = tempfile.mkdtemp(prefix=file_id)

# try to download a file first, if it fails, assume it's a folder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is not valid anymore right since you have a metadata.type with 'dir' or 'file' which will give you if folder or file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When downloading, we currently only have file_id information and no other information. Although we can also use file_id to judge first, it will increase a search.

Or save more information upstream, but actually increase the complexity

WDTY?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be explicit. If we don't have the type then we should raise an Exception. Don't assume anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jieguangzhou jieguangzhou force-pushed the feat/save-directory branch 3 times, most recently from c0fdf02 to a065b06 Compare February 23, 2024 11:28
@blythed blythed merged commit d4179b6 into SuperDuperDB:main Feb 23, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants