-
Notifications
You must be signed in to change notification settings - Fork 95
Description
Is your feature request related to a problem? Please describe.
In a new DrWatson project, .gitignore ignores all directories in a project tree labeled data/, plots/, and videos/ by default. This is beneficial to avoid bloating a repository because git doesn't handle large files well, but models and data are tightly knit together, and replicating a project's environment with code, dependencies, data, and visualizations is made complicated by simply excluding all this from the repository.
Describe the solution you'd like
Incorporating DVC would tractably extend git version control to large files. The git remote (eg GitHub) does not need to handle the data. Rather, it would live in a dvc remote (eg Google Drive or via SSH/SFTP). DVC adds a lightweight metafile to the git repo to be tracked and versioned, which references the location of the data file itself at its remote. One metafile per data file.
Integrating with DrWatson could be to specify at project initialization whether it should be set up with or without dvc. A corresponding .dvc/ would be generated. A no dvc project would be configured with the current data/-, plots/-, and videos/-ignoring .gitignore (and .gitattributes, see #254), and a dvc project would have a more inclusive .gitignore. Then whenever a file is added to dvc tracking, its newly created *.dvc metafile and .gitignore file containing the name of the actual data file would be tracked with git.
As a scientific project assistant, versioning data would be a huge help!
Describe alternatives you've considered
DVC's comparison to and integration with other tools and methods. I think git-LFS and git-annex would be the two things closest to DVC, but I've only read some of DVC's material so far. Really, just starting this discussion to see how data tracking might make its way into DrWatson by any means.
Note: no affiliation with DVC or Iterative. I just like their documentation and instructional videos.