# dvc

Some local experiments to verify the proposal in the DVC Integration PDD will work.

## Using DVC locally

`ResidentMario/dvc-exploration` (e.g. this repo) is a GH repo containing a `data/` root directory with the files `train.csv.dvc`, `test.csv.dvc`, and `train_head.csv.dvc`.

These files were generated by downloading the MNIST dataset from the [Digit Recognizer](https://www.kaggle.com/c/digit-recognizer/data?select=test.csv) competition on Kaggle and unzipping the `train.csv` and `test.csv` files into this directory. Then running the following:

```
dvc init
dvc add data/train.csv
dvc add data/test.csv
cd data
dvc run -d train.csv -o train_head.csv "head train.csv > train_head.csv"
```

Commands two and three:
1. Makes a content addressable copy of the file in `.dvc/cache`.
2. Creates a `FILENAME.EXTENSION.dvc` file locally, e.g., `train.csv.dvc`. This is DVC metadata, the most important field of which is the `md5` hash.
3. Creates or appends to a `.gitignore` in the directory telling `git` to ignore the data files.

The contents of `train.csv.dvc` as an example:

```yaml
md5: f1fe550396d7b876a101ff449eced674
outs:
- md5: f3eaeafb90cde88b238ebc8dfd4501c5
  path: train.csv
  cache: true
  metric: false
  persist: false
```

The last command:
1. Creates a new pipeline definition metadata file, `train_head.csv.dvc`. This contains `md5` hashes for both the input and output files.

The contents of `train_head.csv.dvc`:

```yaml
md5: 0e1ce87d933511bce02403cd3d0ac8a8
cmd: head train.csv > train_head.csv
deps:
- md5: f3eaeafb90cde88b238ebc8dfd4501c5
  path: train.csv
outs:
- md5: e502cb16c02f3572bb68adadcfc50463
  path: train_head.csv
  cache: true
  metric: false
  persist: false
```

The content cache is in `FIRST_TWO_HASH_VALUES/REMAINING_VALUES` format:

```
$ ls .dvc/cache/
a3 e5 f3
$ ls a3
5759d77c0a3dadb4d4253ff87ec430
```

After running these commands we have a repository with code managed by Git and data managed by DVC. So far so good, but it's still not reproducible (or very useful) because the repository is non-portable. The `.cache` directory is added to `.gitignore` at `dvc init` time; it's just a convenience for the targetting reuse of data files on local disk.

## Using DVC with a remote

To get utility out of DVC you need to configure a remote. This will allow you to `dvc push` to and `dvc pull` from the remote to get all of the data artifacts you need. DVC supports all of S3, GCS, and Azure Blob Storage.

Run:

```
$ dvc remote add -d s3://spell-share/aleksey/dvc-exploration
```

This will configure this bucket and path as the default remote for this project. This is controlled by `.dvc/config`, a (`git` version controlled) file which is empty at `dvc init` time but whose contents will now be:

```ini
[core]
    remote = spell-share
['remote "spell-share"']
    url = s3://spell-share/aleksey/dvc-exploration
```

To publish the data to collaborators you now:

```
$ dvc push
```

This flushes the contents of the `.dvc/cache` directory to S3:

```
$ aws s3 ls s3://spell-share/aleksey/dvc-exploration/
                           PRE a3/
                           PRE e5/
                           PRE f3/
```

We can now reproduce the data for the repository from a clean copy of the repository:

```
$ git clone https://github.com/ResidentMario/dvc-exploration.git dvc-exploration-new
$ cd dvc-exploration-new
$ dvc pull
```

AFAICT this will find all of the `*.dvc` files rooted in the current directory (recursive search?), grab their `md5` fields, and `boto3` to pull them from the S3 CAS created in the `dvc push` step.

This unifies all of your team's data fetch operations across all of your projects under a common API, which is very nice indeed!

## What if you already have it in your local cache though

Key observation: **DVC will skip pulling data from S3 if the file is already in the local cache**.

To verify this, try the following:

```
$ cd ..; rm dc-exploration-new
$ git clone https://github.com/ResidentMario/dvc-exploration.git dvc-exploration-new
$ cd dvc-exploration-new
$ cp -rf ../dvc-exploration/.dvc/cache .dvc/
```

This will make a copy of a complete local cache in the new repo without one.

Then running again:

```
$ dvc pull
```

We skip all downloads.

```
3 added
Everything is up to date.
```

So if you have a `.cache` directory with everything you need already defined, it's a no-op. If you have a `.cache` directory with some of the things you need already defined, it's less work and waiting you have to do pulling things over the network.

## What if you symlink the files to someplace else?

Key observation: **DVC still works fine when the local cache files are symlinked to somewhere else on disk**.

I created a `dvc-data-files` directory on my `Desktop` and moved the files included in the cache directory to there.

```
$ mkdir ~/Desktop/dvc-data-files/f3/
$ mv f3/eaeafb90cde88b238ebc8dfd4501c5 ~/Desktop/dvc-data-files/f3/eaeafb90cde88b238ebc8dfd4501c5
$ mkdir ~/Desktop/dvc-data-files/a3/
$ mv a3/5759d77c0a3dadb4d4253ff87ec430 ~/Desktop/dvc-data-files/a3/5759d77c0a3dadb4d4253ff87ec430
$ mkdir ~/Desktop/dvc-data-files/e5/
$ mv e5/02cb16c02f3572bb68adadcfc50463 ~/Desktop/dvc-data-files/e5/02cb16c02f3572bb68adadcfc50463
```

Then, replacing the original files with symlinks to the new file locations:

```
$ ln -sf /Users/alekseybilogur/Desktop/dvc-data-files/e5/02cb16c02f3572bb68adadcfc50463 e5/02cb16c02f3572bb68adadcfc50463
$ ln -sf /Users/alekseybilogur/Desktop/dvc-data-files/a3/5759d77c0a3dadb4d4253ff87ec430 a3/5759d77c0a3dadb4d4253ff87ec430
$ ln -sf /Users/alekseybilogur/Desktop/dvc-data-files/f3/eaeafb90cde88b238ebc8dfd4501c5 f3/eaeafb90cde88b238ebc8dfd4501c5
```

The cache entries have all been replaced with symlinks to files located somewhere else on disk. Does the cache still work?

```
$ dvc pull
No changes.
Everything is up to date.
```

```
dvc add train.csv
Stage is cached, skipping.
  0% Add|                                                                                      |0/0 [00:00,     ?file/s]
```

Looks like it does!

## Basic idea of how this would work on Spell

Spell users mount code into the environment using our `git` and `--github-url` integrations, and they mount data into the environment using the `--mount` flag.

What if we inspected the git repo, looking for a `.dvc` file in the root of the directory. If it exists, we can look at the `config` file to determine if a remote is configured.

If it is, and the user has mounted files from that remote into the run, we're in business! Crawl the repo for `*.dvc` files, extract the CAS addresses of those files. and subset that list to just those files included in the user mounts. Create the `.dvc/cache` directory and populate it with symlinks to the mount paths inside of the run. Place those files ahead of all other files in the goofyscache pull order, so that they get downloaded and added into the image ahead of everything else.