# How to Version Gigabyte-Sized Datasets Just Like Code With DVC in Python
# A Complete Tutorial to Data Version Control With DVC in Python

![](images/pexels.jpg)

### The big problem in data science

When a dataset is large, it creates an even larger mess. Why? Data scientists and ML engineers perform many experiments that involve massive datasets and models. They create huge headaches in terms of collaboration and software engineering best practices. 

In a traditional setting, software engineers collaborate by making copies to the central codebase and suggesting changes to it via pull requests. Then, their changes are reviewed, tested and merged into the main codebase if approved. This process can happen multiple times in a single day.

Tools like Git have matured for almost two decades, making the above process a breeze for programmers. But, Git is only designed for lightweight code scripts, not hundreds of thousands of images we use to train costly CNNs. 

Yes, there are alternatives like GitLFS but it is too much a hassle to set up; it doesn't allow safe branching, commit and experimentation on large files, which are must-have features. 

For this reason, there are now many open-source (or paid) tools to solve these headaches. One of them is DVC (Data Version Control).

### What is data version control and DVC?

Data version control is a process of tracking and versioning changes made to data and models in data projects. A good data version control system must have the following features:

1. Track changes made to data and models like Git handles scripts.
2. Ease of set up and use. Ideally, you should be able to install it with a single command.
3. Compatible with existing systems like Git, so that it shouldn't reinvent the wheel.
4. Support for branching and committing. The system should support the ability to create branches and commits, allowing for safe experimentation and easy collaboration.
5. Reproducibility: The system should enable easy reproduction of experiments, allowing other users to quickly and easily reproduce results.
6. Sharing capabilities: The system should support the ability to easily share data and models with other users, allowing for collaboration and sharing of results.

One tool that has all of the above features is DVC. It integrates seamlessly with Git and imitates its features but for large files.

While Git stores a codebase on hosting services like GitHub or GitLab, DVC uses remote storages to upload data and models. A remote storage can be any cloud provider like AWS, GCP, Azure or even a plain-old directory on your local machine. A remote will be a single source of truth for the whole project, used by all team members. 

When a file is tracked by DVC and added to the remote storage, a lightweight file with the `.dvc` extension is created. The file will serve as a placeholder to the original large file and will contain instructions of how DVC can download it from the remote.

### What will you learn in the tutorial?

By completing this tutorial, you will have a GitHub repository for an image classification project. Other people will be able to get all your code, data, models and experiments with only two commands:

```
$ git clone https://github.com/username/repo.git
$ cd repo
$ dvc pull # Get all the data and models
```

The article will teach you everything needed, so you can run the `dvc pull` command and understand almost everything that goes under the hood. Let's jump right in!

### Setting up the project and environment

Let's get started by creating the `conda` environment will be working in. 

```
conda create -n traffic_signs_recognition python=3.9 -y

conda activate traffic_signs_recognition
```

Next, we clone the following repo and change into the working directory. Alternatively, you can create the working directory yourself and initialize `git` with `git init`.

```
git clone https://github.com/BexTuychiev/traffic_signs_recognition.git
cd traffic_signs_recognition
```

TODO - create a base repo on GitHub to clone with DVC.

Let's first create the `requirements.txt` file with a few dependencies and install them. Run the following commands:

```
$ echo -e "tensorflow\nscikit-learn\nnumpy\npandas\nmatplotlib\nseaborn\nscikit-image\ndvc" >> requirements.txt
$ cat requirements.txt
tensorflow
scikit-learn
numpy
pandas
matplotlib
seaborn
scikit-image
dvc
```

> Running the `echo` command with `-e` tag makes it detect special characters like line breaks (`\n`).

We installed a few standard data libraries along with `scikit-image` for image manipulation and `tensorflow` for building the models. The last one is `dvc`, which is the main focus of the article.

Now, let's build the tree structure of our project:

```bash
$ mkdir data notebooks src data/raw data/prepared data/prepared/train
```

We will store the scripts inside `src`, while `data` and `notebooks` will hold the images and analysis notebooks we might create later.

### Download and set up the data

Now, we will download the dataset for the project. The GTSRB - German Traffic Sign Recognition Benchmark dataset contains more than 50k images divided into 40 road sign categories. Our task is to build a convolutional neural network that can accurately classify each category.

You can go to the [dataset page](https://www.kaggle.com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign) or download it directly using [this link](https://storage.googleapis.com/kaggle-data-sets/82373/191501/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20221210%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20221210T130850Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=65eeae3c577195c0b9185b9e37ab185a3e5cc8c990a501390621201196cfd2e5ecbb0952db6bc443a09d08e252744472705c7bc90caa2c82aaa699b7d24f5592075046a771f05e424bb0d7fc6e8f8bff4e04e25a5e4e2b2e816a966e25df023050344400b97e676d9d0ac0c93c9046a007d74db740d311822fd79ea6bbdfa4d6459de2b2b061ca5187d2bf83c284feef39b06296cf4f46c7bc6f95c6488d7ea78a4eaf28ea43e7f8ef0afd97805d0943782b99377fd35a9e8781f17419d2fff43d66822d56c11802f209822dd86ba4e64edd7800d3125a7cff88b5616fbd3ddc0f2f3dfea2f86325cd185fc88cb5e46d517a846d407d4b6637df713cd8a36c36) and the below commands:

```
$ curl "the_link_inside_quotes" -o data/traffic_signs.zip
```

Once the download is done, unzip the images into the `data/raw` directory. Then, we can remove the unnecessary files and directories like duplicates of the images and metadata. This will leave us only with the `train` and `test` folders inside `data/raw`.

```bash
$ unzip data/traffic_signs.zip -d data/raw
$ cd data/raw
$ rm -rf Train Test Meta meta Meta.csv Test.csv Train.csv
$ rm test/GT-final_test.csv
$ cd ../..
$ rm data/traffic_signs.zip
```

In the end, we will remove the downloaded zipped dataset as well.

The `train` folder has 43 folders, one for each class. Keep this directory structure in mind, as we will use it when training a model.

### Initializing DVC

Having a dataset is already enough to start working with `dvc`. In this section, you will see the basics of how Git and DVC work together. 

To add DVC tracking to our project, we just need to call `dvc init` just like `git init`. DVC only works on top of Git repositories. The `init` command will add a special `.dvc` directory that holds DVC configuration. 

```
$ git status -s
A  .dvc/.gitignore
A  .dvc/config
A  .dvcignore
```

The command will also create `.dvcignore` file that can be used to list directories that should be ignored by DVC. For now, we will leave it empty. 

Once DVC is initialized, it needs a place called a remote storage to upload data and large files so that they aren't tracked by Git. DVC remote can be any cloud storage provider like AWS, Azure, GCP or just any other directory on your machine.

For simplicity, we will set the remote storage for this project to a new directory called `dvc_remote` in the home directory.

```
$ mkdir ~/dvc_remote
$ dvc remote add -d remote ~/dvc_remote
```

The `remote` command is used to work with remote storages. Here, we are naming our remote storage simply `remote`. The `-d` tags tells DVC that `dvc_remote` is your default remote storage path.

Once you run these commands, you can look at the `config` file inside `.dvc` folder:

```bash
$ cat .dvc/config
[core]
    remote = remote
['remote "remote"']
    url = /home/bexgboost/dvc_remote/
```

As you can see, the remote name is listed as `remote` and the `url` is set to a path in my own home directory. If our remote was cloud-based, it would be a web URL.

#### Adding files to track with DVC

- Add files with the `add` command
- When `add` is called, .dvc extension files are created
- They should be tracked with .git
- `add` adds the dirs to .gitignore
- Write a script to preprocess image, resize or scale

```bash
$ git add --all
$ git commit -m "Initialize DVC"
```

To start tracking files and directories, you can use the `dvc add` just like `git add`. Below, we are adding the entire `data` folder to DVC because it contains thousands of images, which would certainly cause a crash if we tried to track them with `git`:

```bash
$ dvc add data
```

When the `add` command is run, here is what happens under the hood:

1. The `data` directory is put under DVC'c control.
2. `data` directory is added to the `.gitignore` file so it will never be tracked by `git`.
3. A lightweight `data.dvc` file is created which serves as a placeholder to the original `data` directory. 

These lightweight `.dvc` (dot-dvc) files are always tracked with Git. When a user clones our Git repository, `.dvc` files will contain instructions of where the original large files are stored.

> Remember that adding files or folders on a new inside a `.gitignore` file will make them invisible to `git` commands.

Now, since the large `data` directory is added to `.gitignore`, we can safely stage all the other files with `git` and commit them:

```
$ git add --all
$ git commit -m "Initialize DVC and add the raw images to DVC"
```

So, here is the summary of how to use Git and DVC in tandem:

1. Whenever you make changes to code or other lightweight files, track the changes with `git add filename` or `git add --all`.
2. Whenever there is a change to large files tracked with `dvc`, track it by running `dvc add file/or/dir`, which updates the corresponding `.dvc` file. So, you add the change in the `.dvc` file to `git`.

For example, running `python src/preprocess.py` will resize and rescale all the images inside `raw/train` and saves them to `data/prepared/train`:

```python
from joblib import Parallel, delayed
from skimage.io import imread, imsave
from skimage.transform import resize
from pathlib import Path
from tqdm import tqdm
import warnings

DATA_DIR = Path("data")
train_dir = DATA_DIR / "raw" / "train"


def resize_image(image_path, target_size):
    """
    Resize image to target size.
    """
    # Resize the image to the target_size
    image = imread(image_path) / 255.0
    image = resize(image, target_size, anti_aliasing=True)

    # Create a new path to the image in the prepared directory
    target_path = str(image_path).replace("raw", "prepared")
    Path(target_path).parent.mkdir(parents=True, exist_ok=True)

    # Save the image to a new path
    imsave(target_path, image)


if __name__ == "__main__":
    image_paths = []
    
    # Collect all image paths from `data/raw/train`
    for directory in train_dir.iterdir():
        image_paths.extend(list(directory.glob("*.png")))

    Parallel(n_jobs=10, backend="multiprocessing")(
        delayed(resize_image)(path, (150, 150)) for path in tqdm(image_paths)
    )

```

The `resize` function takes an image path and reads it using the `imread` function as a NumPy array. It is resized to the `target_size` and saved into a new path inside `prepared` directory. 

In the `__main__` context, we are collecting all image paths and using parallel execution to resize and save multiple images simultaneously.

Once the script finishes, you can see if there were changes to any DVC-tracked files with `dvc status`. You should see an output similar to below:

```bash
$ dvc status
data.dvc:
    changed outs:
          modified:        data/
```

So, we track the new changes with `dvc add` and stage the changes made to `data.dvc` with `git add --all` and commit the changes.

```bash
$ dvc add data
$ git add --all
$ git commit -m "Save resized images"
```

#### Uploading files

Now, let's push all the commits made with `git` and DVC changes. First, we run `git push` followed by `dvc` push.

`git push` will upload the code and `.dvc` files to GitHub, while `dvc push` sends the original and resized images to the `remote`, which is the `~/dvc_remote` directory on your machine.

```bash
$ git push
$ dvc push
```

Once the large files are stored in the remote, you can delete them:

```bash
$ rm -rf data/raw/train
```

If you want to redownload those files, you can simply call `dvc pull`:

```
$ dvc pull
```

`dvc pull` will detect any differences with the working directory and the remote stores and downloads them. 

When a new user clones your Git repository, they will also use the `dvc pull` command to populate the working directory with the files stored in your remote.

### Building an image classification model

Now, it is time to build a baseline model and track it with DVC. In `src/train.py`, we have the following script that trains a baseline CNN using the `ImageDataGenerator` class. Since the focus of the article isn't on TensorFlow, you can learn [how `ImageDataGenerator` works](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) from the docs.

```bash
from pathlib import Path

import tensorflow as tf
from joblib import dump

# Set the paths to the train and validation directories
base_dir = Path(__file__).parent.parent
data_dir = base_dir / "data"

# Create an ImageDataGenerator object for the train set
data_gen = tf.keras.preprocessing.image.ImageDataGenerator(...)

# Generate training data from the train directory
train_generator = data_gen.flow_from_directory(
    data_dir / "raw" / "train",  # Target directory
    target_size=(50, 50),  # Resize images to 150x150
    ...
)


def get_model():
    """Define the model to be fit"""
    # Define a CNN model
    model = tf.keras.models.Sequential(...)

    # Compile the model
    model.compile(...)

    return model


def main():
    # Get the model
    model = get_model()

    # Fit the model
    history = model.fit(
        train_generator,  # Use the train generator
        steps_per_epoch=100,
        epochs=10,  # Train for 10 epochs
    )

    metrics_dir = base_dir / "metrics"
    models_dir = base_dir / "models"
    metrics_dir.mkdir(exist_ok=True)
    models_dir.mkdir(exist_ok=True)

    dump(history.history, metrics_dir / "history.joblib")
    dump(model, models_dir / "model.joblib")


if __name__ == "__main__":
    main()

```

> You can find the full script from the website repository [here](https://github.com/BexTuychiev/traffic_signs_recognition/blob/main/src/train.py).

The important part of the script is the `main` function. Inside, we are fitting and saving the model and its metrics newly-created `models` and `metrics` directories using `joblib.dump`. 

We run the script:

```
$ python src/train.py
```

Once finished, we add the `models` directory to DVC:

```
$ dvc add models
$ git add --all
$ git commit -m "Baseline model with 0.2192 accuracy"
```

Then, we run `git add --all` once again to stage `models.dvc` file and the `history.joblib` file. It is also a good practice to tag each experiment with `git`:

```bash
$ git tag -a baseline -m "Baseline model with 0.2192 accuracy"
```

Finally, we push the commits, DVC changes and tags with:

```
$ dvc push
$ git push
$ git push origin --tags
```

Now, if we want to improve the accuracy score by trying different CNN architectures, we modify the `train.py` script, run it and track the new `model.joblib` and `history.joblib` files. We also create a commit and tag that summarizes that summarizes the model performance. In the end, push the changes and tags with both Git and DVC. 

Even though this machine learning experimentation workflow is simple and effective, in the next part of the article we will see a much better way of tracking our experiments. Using DVC pipelines and VSCode DVC extension, we will be able to visualize our metrics and model runs right inside an IDE.

### DVC internals

Now that you know how to track and upload files to DVC remote, it is time to take a deeper look at DVC internals. 

We've discussed DVC remote, which is similar to GitHub, where you store the latest official version of your data and models uploaded with `dvc push`.

But, just like Git first adds files to a staging area before committing them to GitHub, DVC has a staging area called cache. 

When `dvc init` is called, `cache` directory is added to `.dvc` folder. Every time you call `dvc add` or `dvc commit`, the files will be copied to the cache.

And now, you are asking - doesn't that duplicate the files and waste space? Yes! But just like you can configure the location of the remote storage, you can configure the cache.

In large-scale projects, many professionals use share a single powerful machine instead of laptops or PCs. Therefore, it doesn't make sense for every team member to have a cache under in their own working directory. A solution is to point the cache to a shared location. 

If you've been following along, our projects cache is under `.dvc/cache`. But, we can point to another directory with the following commands:

```
$ dvc cache dir path/to/shared_cache
$ mv .dvc/cache/* path/to/shared_cache
```

Just make sure that all team members have read/write permissions to the `path/to/shared_cache` when sharing a single development machine.

### Conclusion

Here is a summary of working with DVC:

- DVC project is initialized on top of a Git repo with `dvc init`
- You should set up a remote for the project with `dvc remote add -d remote_name path/to/remote`
- To start tracking files, use `dvc add`
- `dvc add` copies the specified directory or files to `.dvc/cache` or `shared_cache/you/specified`, creates `.dvc` files for each tracked folder or file and adds them to `.gitignore`
- `.dvc` and other files are tracked with `git add --all`
- To push commits and DVC-tracked file changes, use both `git push` and `dvc push`
- `dvc push` uploads the files from the cache to the remote storage
- Label each ML experiment run with a tag and repeat `dvc add or commit`/`dvc push` and `git add`/`git push` for each changed file.

This step-by-step tutorial is already enough to solve most of your problems in data science projects in terms of collaboration and reproducibility. In the next part of the article, we will talk more about simplifying machine learning experimentation with DVC (yes, it can be made even easier)!

Thank you for reading!

https://ibexorigin.medium.com/membership

https://ibexorigin.medium.com/subscribe
