Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for managing test baseline images using data version control (dvc) #5724

Closed
7 of 9 tasks
maxrjones opened this issue Sep 1, 2021 · 14 comments · Fixed by #6782
Closed
7 of 9 tasks

Proposal for managing test baseline images using data version control (dvc) #5724

maxrjones opened this issue Sep 1, 2021 · 14 comments · Fixed by #6782
Assignees
Labels
maintenance Boring but important stuff for the core devs

Comments

@maxrjones
Copy link
Member

maxrjones commented Sep 1, 2021

Proposal for managing test baseline images using data version control (dvc)

This issue proposes a solution to #3470 and a partial solution to #2681 by using data version control to manage the baseline images for testing. @weiji14 led an effort to move PyGMT's tests from git version control to data version control with remotes stored on DAGsHub in GenericMappingTools/pygmt#1036; most of the information here is from Wei Ji's posts for PyGMT (thanks! 🙏 🎉 ).

Motivation for migrating baseline images to dvc

Here's the current breakdown for the GMT repository:

  • .git: ~1.1 GB (up from ~720 MB on Feb. 06 2020)
  • test: ~115 MB (101 MB from PS files) (up from ~113 MB on Feb 06. 2020)
  • doc: ~68 MB (51 MB from PS files; 33 MB from PS in doc/examples ; 18 MB from PS in doc/scripts) (down from ~70 MB on Feb. 06 2020)
  • share: ~13.5 MB
  • src: ~16 MB

The fact that the overall repository size increased by 50% over the past 1.5 years while individual directories have remained the same size supports past developer comments that the repository growth rate due to rewriting PS files is unsustainable.

What is data version control

Data version control (dvc) is an open source tool for managing and versioning datasets and models. It is built on Git with very similar syntax. Rather than storing bulky images in the repository, small .dvc files are stored that contain metadata, including the md5 hash for the data file. This allows versioning of data files that are stored in a remote location. Options for remote storage include S3, Google cloud, Azure, SSH server and DAGsHub (PyGMT uses DAGsHub).

Steps required

(Based on PyGMT, may need some updating)

Initial setup (only needs to be done once for the repository)

Installing DVC for developing GMT

Initialize dvc

dvc init # creates .dvcignore file and .dvc/ folder
# remove .dvc/plots folder as won't be used
# Optionally configure the repository to not send anonymous usage data
# git add only the .dvcignore, .dvc/.gitignore and .dvc/config file
git add .dvcignore .dvc/.gitignore .dvc/config
git commit -m "Initialize data version control"

Setup DVC remote

dvc remote add origin https://dagshub.com/GenericMappingTools/gmt.dvc # updates .dvc/config file with remote URL
dvc remote default origin  # set default dvc remote to 'upstream'

Migrating tests

(based on PyGMT steps, may need updating)

# Sync with git and dvc remotes
git pull
dvc pull
# Generate hash for baseline image and stage the *.dvc file in git
git rm --cached 'test/<test-folder>/<test-image>.ps'
mv test/<test-folder>/<test-image>.ps test/baseline/<test-folder>/<test-image>.ps
dvc add test/baseline/<test-folder>
git add test/baseline/<test-folder>.dvc test/baseline/.gitignore
# Commit changes and push to both the git and dvc remotes
git commit -m "Migrate test to DVC"
git push
dvc push

Pull images from DVC remote (for GitHub Actions CI and local testing)

dvc status # should report any files 'not_in_cache'
dvc pull # pull down files from DVC remote cache (fetch + checkout)
cd <build-dir>
ctest

What about the images for documentation?

Test directory is currently much larger than the documentation directory. So, migrating the tests will be a large first step that does not require an established solution for the documentation images. Regardless, my opinion is that we should host the examples/tutorials/animations in a separate repository (#5364 (comment)).

References

Are you willing to help implement and maintain this feature? Yes

@maxrjones maxrjones added the maintenance Boring but important stuff for the core devs label Sep 1, 2021
@maxrjones maxrjones self-assigned this Sep 7, 2021
@maxrjones
Copy link
Member Author

Based on the discussion at the last community meeting, I will start the migration of the baseline images to dvc using DAGsHub for storage.

@weiji14
Copy link
Member

weiji14 commented Sep 7, 2021

Cool, let me know if you need any help 😀

Just a note on storage limits. According to https://dagshub.com/plans, DAGsHub provides up to 10 GB of free space. So I think <200MB from GMT is ok for now (PyGMT probably has <15MB on DAGsHub), but just something to keep in mind when uploading those large PS and video files.

image

@joa-quim
Copy link
Member

joa-quim commented Sep 8, 2021

What does it mean (for us)?
image

@maxrjones
Copy link
Member Author

What does it mean (for us)?
image

Good question. I just checked on their community forum and that limit only applies to private projects, so it does not mean anything for us.

@joa-quim
Copy link
Member

joa-quim commented Sep 8, 2021

Good, thanks.

@maxrjones
Copy link
Member Author

As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files). I am going to try out restructuring the tests so that rather than having .ps files paired with the .sh files in test/**/*.ps there is a single test/baseline/ directory that can be dvc added. I'll test this out in my fork of the gmt repository.

Since the DAGsHub interface supports viewing png files, I am also going to research whether using .png files rather than .ps files will impact performance.

@weiji14
Copy link
Member

weiji14 commented Sep 11, 2021

As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files).

Does tracking a directory mean computing a hash for the whole directory? A bit concerned with what this means if different people are trying to modify different PS files on multiple branches.

Edit: Looking at https://dvc.org/doc/command-reference/add#example-directory, it seems that running dvc add on a directory will indeed produce a single test/baseline/directory.dvc file with a single md5 hash. This directory.dvc file might be a source of multiple merge conflicts.

@maxrjones
Copy link
Member Author

@PaulWessel, for PyGMT we bundle up the test images at release time and include that as an asset for the github and zenodo releases. Do you think this is desirable for GMT as well?

Benefits -

  • provides a way to completely reproduce the git/dvc history in the event that the DAGsHub server is lost
  • possibly easier/quicker for comparing past test results than checking out older git commits, rebuilding, and re-running the tests (in the event that the DAGsHub information is lost)

Downsides

  • requires some work to setup (probably not too difficult since we can imitate pygmt's setup)
  • may distract from the more important release assets

@joa-quim
Copy link
Member

There are 104 MB of files in test\baseline, is this what you are referring?

@maxrjones
Copy link
Member Author

There are 104 MB of files in test\baseline, is this what you are referring?

Yes. Not to go in the source tarballs, bundles, or windows installers. Just zip up that when we do a release and archive it somewhere.

@joa-quim
Copy link
Member

Right, backups are never a bad idea. The PS files should compress significantly.

@PaulWessel
Copy link
Member

I agree, good think to do for self-preservation.

@seisman
Copy link
Member

seisman commented Oct 28, 2023

As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files).

Does tracking a directory mean computing a hash for the whole directory? A bit concerned with what this means if different people are trying to modify different PS files on multiple branches.

Edit: Looking at https://dvc.org/doc/command-reference/add#example-directory, it seems that running dvc add on a directory will indeed produce a single test/baseline/directory.dvc file with a single md5 hash. This directory.dvc file might be a source of multiple merge conflicts.

Tracking directories has caused a lot of troubles for us recently. For example, all the PS files of the 52 examples are DVC-tracked in a single DVC file (i.e., doc/examples/images.dvc). Currently, its content is:

outs:                                                                           
- md5: 4dd0ad31844cb0b0b451648cda314e2a.dir                                        
  size: 37295153                                                                   
  nfiles: 53                                                                    
  path: images                                                                  
  hash: md5 

The troubles are:

  • When we update any PS files, we will see the md5, size and/or nfiles change but we don't know which file is changed (e.g., update ex15 PS and dvc .git file #7978), although we have the dvc-diff workflow (which sometimes works but sometimes not)
  • When we have two PRs that update different examples, the doc/examples/images.dvc file will change in both PRs. Then if one PR is merged, another PR will see a conflict in the doc/examples/images.dvc file, which is difficult to resolve (e.g., Convert test script for segmentizing to doc plot #7967).

So, tracking directories is not a good choice for us.
As I understand about @maxrjones's comment #5724 (comment), tracking a large number of files is not efficient, but are 1000 files too much? After reading some upstream DVC issues and it seems they are talking about the inefficiency for 10k to millions of files. The number of GMT's PS files will definitely increase, but I don't think we will have more than 2000 files in the next few yeas. So maybe we should try tracking individual files instead?

@seisman seisman reopened this Oct 28, 2023
@maxrjones
Copy link
Member Author

As I understand about @maxrjones's comment #5724 (comment), tracking a large number of files is not efficient, but are 1000 files too much? After reading some upstream DVC issues and it seems they are talking about the inefficiency for 10k to millions of files. The number of GMT's PS files will definitely increase, but I don't think we will have more than 2000 files in the next few yeas. So maybe we should try tracking individual files instead?

As the .dvc files are so small and we can always purge images from the dagshub repo, the only real risk here seems to be the amount of time it would take to try this and go back if necessary. I would guess it would take a couple hours of work to go from the current structure to tracking individual files, and likely about the same to go back if it turns out to be more of a headache. Seems worth trying IMO given the recent frustrations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants