# A Complete Tutorial to Data Version Control With DVC in Python
## Version gigabyte-sized datasets just like code

### The big problem in data science

- The problem in data science:
    - Engineers perform a lot of experiments that involve both data and models, which have very large size.
    - Git can't handle large files
    - Alternatives like GitLFS is too much a hassle to set up, slows down Git repos, and increase their size.
    - No options to easily create branches, commits, safe experimentation.
    - Result - no easy way to reproduce experiments and share both code and data/models simultaneously.

### What is data version control and DVC?

- Tools like Git have matured for many years in plain softeng setting
- In a traditional project, team members can collaborate by making copies to the central code repo, requesting their changes become integrated into the original copy. Then, their changes are reviewed and tested.
- This cycle can repeat multiple times in a single day but we have the above-mentioned problems to perform this.
- That's why many engineers are working on tools to integrate such softeng best-practices into datasc projects. 
- One such tool is Data Version Control (DVC) which is used alongside Git
- Doesn't reinvent the wheel, work in tandem. Git manages lightweight code files while DVC handles heavyweight data/models.
- DVC stores in remote repository. The remote can be a directory anywhere on your machine or any cloud provider like AWS, GCP or Azure. 
- When a file added to the remote, a lightweight .dvc file is created, which in turn is version by git. 
- When you host your git repo on GitHub, only code files and .dvc files will be stored, while models/data will be on the remote. When a person clones your GitHub, they can use the info inside .dvc files to restore the models/data with a single command.

### What will you learn in the tutorial?

- Version large files with DVC alongside git
- How to set up a local remote store 
- Basics of DVC workflow

### Setting up the project and environment

Let's get started by creating the `conda` environment will be working in. 

```
conda create -n traffic_signs_recognition python=3.9 -y

conda activate traffic_signs_recognition
```

Next, we clone the following repo and change into the working directory. Alternatively, you can create the working directory yourself and initialize `git` with `git init`.

```
git clone https://github.com/BexTuychiev/traffic_signs_recognition.git
cd traffic_signs_recognition
```

TODO - create a base repo on GitHub to clone with DVC.

Let's first create the `requirements.txt` file with a few dependencies and install them. Run the following commands:

```
$ echo -e "tensorflow\nscikit-learn\nnumpy\npandas\nmatplotlib\nseaborn\nscikit-image\ndvc" >> requirements.txt
$ cat requirements.txt
tensorflow
scikit-learn
numpy
pandas
matplotlib
seaborn
scikit-image
dvc
```

> Running the `echo` command with `-e` tag makes it detect special characters like line breaks (`\n`).

We installed a few standard data libraries along with `scikit-image` for image manipulation and `tensorflow` for building the models. The last one is `dvc`, which is the main focus of the article.

Now, let's build the tree structure of our project:

```bash
$ mkdir data notebooks src data/raw data/prepared data/prepared/train
```

We will store the scripts inside `src`, while `data` and `notebooks` will hold the images and analysis notebooks we might create later.

### Download and set up the data

Now, we will download the dataset for the project. The GTSRB - German Traffic Sign Recognition Benchmark dataset contains more than 50k images divided into 40 road sign categories. Our task is to build a convolutional neural network that can accurately classify each category.

You can go to the [dataset page](https://www.kaggle.com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign) or download it directly using [this link](https://storage.googleapis.com/kaggle-data-sets/82373/191501/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20221210%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20221210T130850Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=65eeae3c577195c0b9185b9e37ab185a3e5cc8c990a501390621201196cfd2e5ecbb0952db6bc443a09d08e252744472705c7bc90caa2c82aaa699b7d24f5592075046a771f05e424bb0d7fc6e8f8bff4e04e25a5e4e2b2e816a966e25df023050344400b97e676d9d0ac0c93c9046a007d74db740d311822fd79ea6bbdfa4d6459de2b2b061ca5187d2bf83c284feef39b06296cf4f46c7bc6f95c6488d7ea78a4eaf28ea43e7f8ef0afd97805d0943782b99377fd35a9e8781f17419d2fff43d66822d56c11802f209822dd86ba4e64edd7800d3125a7cff88b5616fbd3ddc0f2f3dfea2f86325cd185fc88cb5e46d517a846d407d4b6637df713cd8a36c36) and the below commands:

```
$ curl "the_link_inside_quotes" -o data/traffic_signs.zip
```

Once the download is done, unzip the images into the `data/raw` directory. Then, we can remove the unnecessary files and directories like duplicates of the images and metadata. This will leave us only with the `train` and `test` folders inside `data/raw`.

```bash
$ unzip data/traffic_signs.zip -d data/raw
$ cd data/raw
$ rm -rf Train Test Meta meta Meta.csv Test.csv Train.csv
$ rm test/GT-final_test.csv
$ cd ../..
$ rm data/traffic_signs.zip
```

In the end, we will remove the downloaded zipped dataset as well.

The `train` folder has 43 folders, one for each class. Keep this directory structure in mind, as we will use it when training a model.

### Initializing DVC

Having a dataset is already enough to start working with `dvc`. In this section, you will see the basics of how Git and DVC work together. 

To add DVC tracking to our project, we just need to call `dvc init` just like `git init`. DVC only works on top of Git repositories. The `init` command will add a special `.dvc` directory that holds DVC configuration. 

```
$ git status -s
A  .dvc/.gitignore
A  .dvc/config
A  .dvcignore
```

The command will also create `.dvcignore` file that can be used to list directories that should be ignored by DVC. For now, we will leave it empty. 

Once DVC is initialized, it needs a place called a remote storage to upload data and large files so that they aren't tracked by Git. DVC remote can be any cloud storage provider like AWS, Azure, GCP or just any other directory on your machine.

For simplicity, we will set the remote storage for this project to a new directory called `dvc_remote` in the home directory.

```
$ mkdir ~/dvc_remote
$ dvc remote add -d remote ~/dvc_remote
```

The `remote` command is used to work with remote storages. Here, we are naming our remote storage simply `remote`. The `-d` tags tells DVC that `dvc_remote` is your default remote storage path.

Once you run these commands, you can look at the `config` file inside `.dvc` folder:

```bash
$ cat .dvc/config
[core]
    remote = remote
['remote "remote"']
    url = /home/bexgboost/dvc_remote/
```

As you can see, the remote name is listed as `remote` and the `url` is set to a path in my own home directory. If our remote was cloud-based, it would be a web URL.

#### Adding files to track with DVC

- Add files with the `add` command
- When `add` is called, .dvc extension files are created
- They should be tracked with .git
- `add` adds the dirs to .gitignore
- Write a script to preprocess image, resize or scale

```bash
$ git add --all
$ git commit -m "Initialize DVC"
```

To start tracking files and directories, you can use the `dvc add` just like `git add`. Below, we are adding the entire `data` folder to DVC because it contains thousands of images, which would certainly cause a crash if we tried to track them with `git`:

```bash
$ dvc add data
```

When the `add` command is run, here is what happens under the hood:

1. The `data` directory is put under DVC'c control.
2. `data` directory is added to the `.gitignore` file so it will never be tracked by `git`.
3. A lightweight `data.dvc` file is created which serves as a placeholder to the original `data` directory. 

These lightweight `.dvc` (dot-dvc) files are always tracked with Git. When a user clones our Git repository, `.dvc` files will contain instructions of where the original large files are stored.

Now, since the large `data` directory is added to `.gitignore`

```
$ git add --all
$ git commit -m "Add the raw images to DVC"
```

Run preprocess.py

```bash
$ dvc add data
$ git add --all
$ git commit -m "Save resized images"
```

#### Uploading files

- Push changes to Git remote with `git push`
- Push changes with `dvc push`
- Remove and redownload files with `dvc pull`

```bash
$ git push
$ dvc push
```

### Building an image classification model

- Creating the train.py using a code like from the other notebook
- Save the model with its metric
- Add the models directory to dvc

### DVC internals

- Explain what .dvc file contains
- MD5 is a hash or a checksum, 32 character, changes wildly if only a single byte is different
- Explain the top-level dvc.cache file

```
outs:
- md5: 24a6110d94afa0d2de710ffef5d22f21.dir
  size: 235642986
  nfiles: 39209
  path: raw

```

```
[{"md5": "feed603ff276416ffd8731cd3e26f370", "relpath": "train/0/00000_00000_00000.png"}, 
 {"md5": "e8094e491547578ff49635a7f6280860", "relpath": "train/0/00000_00000_00001.png"}, 
 {"md5": "ac60c11dfed53788d8057b74f8cd8a12", "relpath": "train/0/00000_00000_00002.png"}, 
 {"md5": "7a242b3eb51e618c7e8f95925b09ac47", "relpath": "train/0/00000_00000_00003.png"}
```

### Conclusion