A demonstration of Data Version Control (DVC) for managing ML pipelines and data versioning.
DVC is an open-source version control system for machine learning projects. It helps you:
- Version control large files, data sets, machine learning models, and metrics
- Track ML experiments
- Create reproducible ML pipelines
- Collaborate with team members
.
├── data/ # Raw and processed data files
│ └── raw.dvc # DVC file for raw data
├── src/ # Source code for data processing and model training
├── config/ # Configuration files
├── .dvc/ # DVC internal files
├── dvc.yaml # DVC pipeline definition
├── dvc.lock # DVC lock file for reproducible pipelines
└── .dvcignore # Files/directories to be ignored by DVC
- Install project dependencies using uv:
uv sync dvc
- Pull the data from remote storage:
dvc pull
- Run the pipeline to reproduce all stages:
dvc repro
- Track data files:
dvc add <file>
- Push data to remote storage:
dvc push
- Pull data from remote storage:
dvc pull
- Check status:
dvc status