# dvc

Some local experiments to verify the proposal in the DVC Integration PDD will work.

## Using DVC locally

`ResidentMario/dvc-exploration` (e.g. this repo) is a GH repo containing a `data/` root directory with the files `train.csv.dvc`, `test.csv.dvc`, and `train_head.csv.dvc`.

These files were generated by downloading the MNIST dataset from the [Digit Recognizer](https://www.kaggle.com/c/digit-recognizer/data?select=test.csv) competition on Kaggle and unzipping the `train.csv` and `test.csv` files into this directory. Then running the following:

```
dvc init
dvc add data/train.csv
dvc add data/test.csv
cd data
dvc run -d train.csv -o train_head.csv "head train.csv > train_head.csv"
```

Commands two and three:
1. Makes a content addressable copy of the file in `.dvc/cache`.
2. Creates a `FILENAME.EXTENSION.dvc` file locally, e.g., `train.csv.dvc`. This is DVC metadata, the most important field of which is the `md5` hash.
3. Creates or appends to a `.gitignore` in the directory telling `git` to ignore the data files.

The contents of `train.csv.dvc` as an example:

```yaml
md5: f1fe550396d7b876a101ff449eced674
outs:
- md5: f3eaeafb90cde88b238ebc8dfd4501c5
  path: train.csv
  cache: true
  metric: false
  persist: false
```

The last command:
1. Creates a new pipeline definition metadata file, `train_head.csv.dvc`. This contains `md5` hashes for both the input and output files.

The contents of `train_head.csv.dvc`:

```yaml
md5: 0e1ce87d933511bce02403cd3d0ac8a8
cmd: head train.csv > train_head.csv
deps:
- md5: f3eaeafb90cde88b238ebc8dfd4501c5
  path: train.csv
outs:
- md5: e502cb16c02f3572bb68adadcfc50463
  path: train_head.csv
  cache: true
  metric: false
  persist: false
```

The content cache is in `FIRST_TWO_HASH_VALUES/REMAINING_VALUES` format:

```
$ ls .dvc/cache/
a3 e5 f3
$ ls a3
5759d77c0a3dadb4d4253ff87ec430
```

After running these commands we have a repository with code managed by Git and data managed by DVC. So far so good, but it's still not reproducible (or very useful) because the repository is non-portable. The `.cache` directory is added to `.gitignore` at `dvc init` time; it's just a convenience for the targetting reuse of data files on local disk.

## Using DVC with a remote

To get utility out of DVC you need to configure a remote. This will allow you to `dvc push` to and `dvc pull` from the remote to get all of the data artifacts you need. DVC supports all of S3, GCS, and Azure Blob Storage.

Run:

```
$ dvc remote add -d s3://spell-share/aleksey/dvc-exploration
```

This will configure this bucket and path as the default remote for this project. This is controlled by `.dvc/config`, a (`git` version controlled) file which is empty at `dvc init` time but whose contents will now be:

```ini
[core]
    remote = spell-share
['remote "spell-share"']
    url = s3://spell-share/aleksey/dvc-exploration
```

To publish the data to collaborators you now:

```
$ dvc push
```

This flushes the contents of the `.dvc/cache` directory to S3:

```
$ aws s3 ls s3://spell-share/aleksey/dvc-exploration/
                           PRE a3/
                           PRE e5/
                           PRE f3/
```

We can now reproduce the data for the repository from a clean copy of the repository:

```
$ git clone https://github.com/ResidentMario/dvc-exploration.git dvc-exploration-new
$ cd dvc-exploration-new
$ dvc pull
```

AFAICT this will find all of the `*.dvc` files rooted in the current directory (recursive search?), grab their `md5` fields, and `boto3` to pull them from the S3 CAS created in the `dvc push` step.

This unifies all of your team's data fetch operations across all of your projects under a common API, which is very nice indeed!

## What if you already have it in your local cache though

Key observation: **DVC will skip pulling data from S3 if the file is already in the local cache**.

To verify this, try the following:

```
$ cd ..; rm dc-exploration-new
$ git clone https://github.com/ResidentMario/dvc-exploration.git dvc-exploration-new
$ cd dvc-exploration-new
$ cp -rf ../dvc-exploration/.dvc/cache .cache/
```

This will make a copy of a complete local cache in the new repo without one.

In [7]:
!ls -lah ../data/

total 72
drwxr-xr-x  8 alekseybilogur  staff   256B May 18 19:45 [1m[36m.[m[m
drwxr-xr-x  9 alekseybilogur  staff   288B May 18 19:34 [1m[36m..[m[m
-rw-r--r--  1 alekseybilogur  staff    37B May 18 19:33 .gitignore
lrwxr-xr-x  1 alekseybilogur  staff    53B May 18 19:45 [35mtest.csv[m[m -> /Users/alekseybilogur/Desktop/dvc-data-files/test.csv
-rw-r--r--  1 alekseybilogur  staff   148B May 18 19:20 test.csv.dvc
lrwxr-xr-x  1 alekseybilogur  staff    54B May 18 19:24 [35mtrain.csv[m[m -> /Users/alekseybilogur/Desktop/dvc-data-files/train.csv
-rw-r--r--  1 alekseybilogur  staff   149B May 18 19:20 train.csv.dvc
-rw-r--r--  1 alekseybilogur  staff    23K May 18 19:33 train_head.csv


The DVC run command succeeded with the expected result. The symlinks are still in place.

In [9]:
!ls -lah ../.dvc/cache

total 0
drwxr-xr-x   5 alekseybilogur  staff   160B May 18 19:33 [1m[36m.[m[m
drwxr-xr-x  10 alekseybilogur  staff   320B May 18 19:33 [1m[36m..[m[m
drwxr-xr-x   3 alekseybilogur  staff    96B May 18 19:20 [1m[36ma3[m[m
drwxr-xr-x   3 alekseybilogur  staff    96B May 18 19:33 [1m[36me5[m[m
drwxr-xr-x   3 alekseybilogur  staff    96B May 18 19:20 [1m[36mf3[m[m


Of course the cache entry is still populated:

In [13]:
# !head ../.dvc/cache/a3/5759d77c0a3dadb4d4253ff87ec430

This experiment proves that DVC can target files not under its control which are symlinks.

Recall that the cache is where DVC stores the copy of the dataset which it is in control of. Cache entries are files of the `./cache/TWO/MANY` format, where `TWO` is the first two digits of a content hash and `MANY` is the remainder. Files are copied (moved? the docs say "moved" and I should verify what the behavior here is exactly) into this directory at `dvc add` time. DVC then creates a `.gitignore` file in the target file's directory, placing the dataset out of `git` version control.

From now on DVC command targeting the original path for the dataset will transparently go to the cache copy of the data instead. The original data file will appear not appear in the Git repository.

This seems...kind of useless, honestly? It doesn't help me in any way because other users don't have access to my local filesystem, obviously.

The more interesting part of DVC is the through-remote. DVC has remote setup, push, and pull semantics targetting object storage.

I configured `s3://spell-share/aleksey/dvc-exploration` as my default remote via `dvc remote add`. I then used `dvc push` to send the files.

Then after a `git clone https://github.com/ResidentMario/dvc-exploration.git dvc-exploration-new` and a `cd dvc-exploration-new` and a `dvc pull` I get the data files down from S3 safe and sound. Peachy; this is the core value prop of DVC.

## Experiment 2

Let's try modifying the repository that gets pulled down, overwriting the cache entry pointing into S3 with an entry pointing to a path on the local machine instead. `dvc pull` should then target this instead of S3.

In [24]:
!dvc list ../

[01;34mdata[0m                                                                
[0m

In [18]:
!cat ../data/.gitignore

/train.csv
/test.csv
/train_head.csv
