Proposal for managing test baseline images using data version control (dvc) #5724

maxrjones · 2021-09-01T18:55:37Z

Proposal for managing test baseline images using data version control (dvc)

This issue proposes a solution to #3470 and a partial solution to #2681 by using data version control to manage the baseline images for testing. @weiji14 led an effort to move PyGMT's tests from git version control to data version control with remotes stored on DAGsHub in GenericMappingTools/pygmt#1036; most of the information here is from Wei Ji's posts for PyGMT (thanks! 🙏 🎉 ).

Motivation for migrating baseline images to dvc

Here's the current breakdown for the GMT repository:

.git: ~1.1 GB (up from ~720 MB on Feb. 06 2020)
test: ~115 MB (101 MB from PS files) (up from ~113 MB on Feb 06. 2020)
doc: ~68 MB (51 MB from PS files; 33 MB from PS in doc/examples ; 18 MB from PS in doc/scripts) (down from ~70 MB on Feb. 06 2020)
share: ~13.5 MB
src: ~16 MB

The fact that the overall repository size increased by 50% over the past 1.5 years while individual directories have remained the same size supports past developer comments that the repository growth rate due to rewriting PS files is unsustainable.

What is data version control

Data version control (dvc) is an open source tool for managing and versioning datasets and models. It is built on Git with very similar syntax. Rather than storing bulky images in the repository, small .dvc files are stored that contain metadata, including the md5 hash for the data file. This allows versioning of data files that are stored in a remote location. Options for remote storage include S3, Google cloud, Azure, SSH server and DAGsHub (PyGMT uses DAGsHub).

Steps required

(Based on PyGMT, may need some updating)

Add DVC as a dependency for developing GMT
Initialize dvc in the repository (e.g., Initialize data version control for managing test images pygmt#1036)
Setup DVC remote
Add instructions for using DVC for image-based testing to the contributing guide
Setup dvc for CI tests
Add a workflow to support side-by-side comparison of modified images (e.g., Improve the DVC image diff workflow to support side-by-side comparison of modified images pygmt#1219)
Migrate existing test baseline images to DVC (e.g., Migrate tests to use dvc-tracked baseline images pygmt#1131)
Exclude dvc related files from the source distribution (Add .dvc to cmake remove directories step #6206)
Optionally, add baseline images as a separate release asset for each release (e.g., Add a workflow to upload baseline images as a release asset pygmt#1317)

Initial setup (only needs to be done once for the repository)

Installing DVC for developing GMT

Add link to dvc install instructions in the wiki.
Add dvc to development requirements list in BUILDING.md.

Initialize dvc

dvc init # creates .dvcignore file and .dvc/ folder
# remove .dvc/plots folder as won't be used
# Optionally configure the repository to not send anonymous usage data
# git add only the .dvcignore, .dvc/.gitignore and .dvc/config file
git add .dvcignore .dvc/.gitignore .dvc/config
git commit -m "Initialize data version control"

Setup DVC remote

Setup mirror of the GMT repository on the GMT DAGsHub organization

dvc remote add origin https://dagshub.com/GenericMappingTools/gmt.dvc # updates .dvc/config file with remote URL
dvc remote default origin  # set default dvc remote to 'upstream'

Migrating tests

Get added as a collaborator on DAGsHub and set up authentication

(based on PyGMT steps, may need updating)

# Sync with git and dvc remotes
git pull
dvc pull
# Generate hash for baseline image and stage the *.dvc file in git
git rm --cached 'test/<test-folder>/<test-image>.ps'
mv test/<test-folder>/<test-image>.ps test/baseline/<test-folder>/<test-image>.ps
dvc add test/baseline/<test-folder>
git add test/baseline/<test-folder>.dvc test/baseline/.gitignore
# Commit changes and push to both the git and dvc remotes
git commit -m "Migrate test to DVC"
git push
dvc push

Pull images from DVC remote (for GitHub Actions CI and local testing)

dvc status # should report any files 'not_in_cache'
dvc pull # pull down files from DVC remote cache (fetch + checkout)
cd <build-dir>
ctest

What about the images for documentation?

Test directory is currently much larger than the documentation directory. So, migrating the tests will be a large first step that does not require an established solution for the documentation images. Regardless, my opinion is that we should host the examples/tutorials/animations in a separate repository (#5364 (comment)).

References

Are you willing to help implement and maintain this feature? Yes

The text was updated successfully, but these errors were encountered:

maxrjones · 2021-09-07T20:18:43Z

Based on the discussion at the last community meeting, I will start the migration of the baseline images to dvc using DAGsHub for storage.

weiji14 · 2021-09-07T22:25:57Z

Cool, let me know if you need any help 😀

Just a note on storage limits. According to https://dagshub.com/plans, DAGsHub provides up to 10 GB of free space. So I think <200MB from GMT is ok for now (PyGMT probably has <15MB on DAGsHub), but just something to keep in mind when uploading those large PS and video files.

joa-quim · 2021-09-08T12:34:06Z

What does it mean (for us)?

maxrjones · 2021-09-08T15:40:32Z

What does it mean (for us)?

Good question. I just checked on their community forum and that limit only applies to private projects, so it does not mean anything for us.

joa-quim · 2021-09-08T17:01:00Z

Good, thanks.

maxrjones · 2021-09-10T17:47:43Z

As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files). I am going to try out restructuring the tests so that rather than having .ps files paired with the .sh files in test/**/*.ps there is a single test/baseline/ directory that can be dvc added. I'll test this out in my fork of the gmt repository.

Since the DAGsHub interface supports viewing png files, I am also going to research whether using .png files rather than .ps files will impact performance.

weiji14 · 2021-09-11T00:35:44Z

As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files).

Does tracking a directory mean computing a hash for the whole directory? A bit concerned with what this means if different people are trying to modify different PS files on multiple branches.

Edit: Looking at https://dvc.org/doc/command-reference/add#example-directory, it seems that running dvc add on a directory will indeed produce a single test/baseline/directory.dvc file with a single md5 hash. This directory.dvc file might be a source of multiple merge conflicts.

maxrjones · 2022-05-31T20:55:33Z

@PaulWessel, for PyGMT we bundle up the test images at release time and include that as an asset for the github and zenodo releases. Do you think this is desirable for GMT as well?

Benefits -

provides a way to completely reproduce the git/dvc history in the event that the DAGsHub server is lost
possibly easier/quicker for comparing past test results than checking out older git commits, rebuilding, and re-running the tests (in the event that the DAGsHub information is lost)

Downsides

requires some work to setup (probably not too difficult since we can imitate pygmt's setup)
may distract from the more important release assets

joa-quim · 2022-05-31T21:50:27Z

There are 104 MB of files in test\baseline, is this what you are referring?

maxrjones · 2022-05-31T21:52:23Z

There are 104 MB of files in test\baseline, is this what you are referring?

Yes. Not to go in the source tarballs, bundles, or windows installers. Just zip up that when we do a release and archive it somewhere.

joa-quim · 2022-05-31T21:57:00Z

Right, backups are never a bad idea. The PS files should compress significantly.

PaulWessel · 2022-06-01T04:56:00Z

I agree, good think to do for self-preservation.

seisman · 2023-10-28T07:49:56Z

As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files).

Does tracking a directory mean computing a hash for the whole directory? A bit concerned with what this means if different people are trying to modify different PS files on multiple branches.

Edit: Looking at https://dvc.org/doc/command-reference/add#example-directory, it seems that running dvc add on a directory will indeed produce a single test/baseline/directory.dvc file with a single md5 hash. This directory.dvc file might be a source of multiple merge conflicts.

Tracking directories has caused a lot of troubles for us recently. For example, all the PS files of the 52 examples are DVC-tracked in a single DVC file (i.e., doc/examples/images.dvc). Currently, its content is:

outs:                                                                           
- md5: 4dd0ad31844cb0b0b451648cda314e2a.dir                                        
  size: 37295153                                                                   
  nfiles: 53                                                                    
  path: images                                                                  
  hash: md5

The troubles are:

When we update any PS files, we will see the md5, size and/or nfiles change but we don't know which file is changed (e.g., update ex15 PS and dvc .git file #7978), although we have the dvc-diff workflow (which sometimes works but sometimes not)
When we have two PRs that update different examples, the doc/examples/images.dvc file will change in both PRs. Then if one PR is merged, another PR will see a conflict in the doc/examples/images.dvc file, which is difficult to resolve (e.g., Convert test script for segmentizing to doc plot #7967).

So, tracking directories is not a good choice for us.
As I understand about @maxrjones's comment #5724 (comment), tracking a large number of files is not efficient, but are 1000 files too much? After reading some upstream DVC issues and it seems they are talking about the inefficiency for 10k to millions of files. The number of GMT's PS files will definitely increase, but I don't think we will have more than 2000 files in the next few yeas. So maybe we should try tracking individual files instead?

maxrjones · 2023-10-29T15:13:22Z

As I understand about @maxrjones's comment #5724 (comment), tracking a large number of files is not efficient, but are 1000 files too much? After reading some upstream DVC issues and it seems they are talking about the inefficiency for 10k to millions of files. The number of GMT's PS files will definitely increase, but I don't think we will have more than 2000 files in the next few yeas. So maybe we should try tracking individual files instead?

As the .dvc files are so small and we can always purge images from the dagshub repo, the only real risk here seems to be the amount of time it would take to try this and go back if necessary. I would guess it would take a couple hours of work to go from the current structure to tracking individual files, and likely about the same to go back if it turns out to be more of a headache. Seems worth trying IMO given the recent frustrations.

maxrjones added the maintenance Boring but important stuff for the core devs label Sep 1, 2021

maxrjones self-assigned this Sep 7, 2021

weiji14 mentioned this issue Sep 11, 2021

dvc add pygmt/tests/baseline rather than individual test files GenericMappingTools/pygmt#1490

Closed

maxrjones added this to the 6.3.0 milestone Sep 22, 2021

seisman mentioned this issue Sep 24, 2021

Add a page for simple examples #5364

Closed

This was referenced Oct 19, 2021

Structure contributing guide to support more details #5878

Merged

Initialize data version control for managing test images #5888

Merged

maxrjones mentioned this issue Jan 10, 2022

Add .dvc to cmake remove directories step #6206

Merged

4 tasks

maxrjones mentioned this issue Jan 25, 2022

Use dvc to manage docs images #6267

Merged

maxrjones mentioned this issue Feb 4, 2022

Migrate remaining test images to DVC #6301

Merged

4 tasks

maxrjones mentioned this issue Mar 18, 2022

Move tests to a separate repository #3470

Closed

maxrjones mentioned this issue Apr 7, 2022

Include PS images for docs in releases #6533

Merged

This was referenced Jun 7, 2022

Release GMT 6.4.0 #6772

Closed

Add workflow for uploading baseline images as a release asset #6782

Merged

maxrjones closed this as completed in #6782 Jun 10, 2022

seisman reopened this Oct 28, 2023

seisman mentioned this issue Nov 3, 2023

DVC: Migrate from tracking directories to tracking files for doc images #8010

Merged

seisman closed this as completed Nov 3, 2023

seisman mentioned this issue Dec 13, 2023

Organize baseline images by method and/or cache baseline images during GitHub workflows GenericMappingTools/pygmt#2117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for managing test baseline images using data version control (dvc) #5724

Proposal for managing test baseline images using data version control (dvc) #5724

maxrjones commented Sep 1, 2021 •

edited

Loading

maxrjones commented Sep 7, 2021

weiji14 commented Sep 7, 2021

joa-quim commented Sep 8, 2021

maxrjones commented Sep 8, 2021

joa-quim commented Sep 8, 2021

maxrjones commented Sep 10, 2021

weiji14 commented Sep 11, 2021 •

edited

Loading

maxrjones commented May 31, 2022

joa-quim commented May 31, 2022

maxrjones commented May 31, 2022

joa-quim commented May 31, 2022

PaulWessel commented Jun 1, 2022

seisman commented Oct 28, 2023 •

edited

Loading

maxrjones commented Oct 29, 2023

Proposal for managing test baseline images using data version control (dvc) #5724

Proposal for managing test baseline images using data version control (dvc) #5724

Comments

maxrjones commented Sep 1, 2021 • edited Loading

Proposal for managing test baseline images using data version control (dvc)

Motivation for migrating baseline images to dvc

What is data version control

Steps required

Initial setup (only needs to be done once for the repository)

Migrating tests

Pull images from DVC remote (for GitHub Actions CI and local testing)

What about the images for documentation?

References

maxrjones commented Sep 7, 2021

weiji14 commented Sep 7, 2021

joa-quim commented Sep 8, 2021

maxrjones commented Sep 8, 2021

joa-quim commented Sep 8, 2021

maxrjones commented Sep 10, 2021

weiji14 commented Sep 11, 2021 • edited Loading

maxrjones commented May 31, 2022

joa-quim commented May 31, 2022

maxrjones commented May 31, 2022

joa-quim commented May 31, 2022

PaulWessel commented Jun 1, 2022

seisman commented Oct 28, 2023 • edited Loading

maxrjones commented Oct 29, 2023

maxrjones commented Sep 1, 2021 •

edited

Loading

weiji14 commented Sep 11, 2021 •

edited

Loading

seisman commented Oct 28, 2023 •

edited

Loading