# Data and Model Versioning



## Initialise the Project

In [2]:
!dvc init --subdir

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

## Setup a Remote Artefact Location

Which for this demo will be an AWS S3 bucket.

In [1]:
!dvc remote add -d s3 s3://dvc-example-artefacts

Setting 's3' as a default remote.
[0m

Note that the `-d` flag will set this remote as the default, so that dvc commands like `dvc add` will use it as a default.

## Start Tracking a Dataset

We start by creating the dataset using Pandas, but this could be a proxy for any data ingestion operation - e.g., querying a database to retrieve the latest tranche of training data.

In [9]:
import pandas as pd


df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "z": ["a", "b", "c", "d", "e"]})
df.to_csv("datasets/example.csv")

Next, we get DVC to start tracking this new dataset.

In [10]:
!dvc add datasets/example.csv

[?25l                                                                          [32m⠋[0m Checking graph
Adding...                                                                       
![A
  0%|          |                                   0.00/? [00:00<?,        ?B/s][A
                                                                                [A
![A
  0%|          |.WTQTmCvBSY3whkYBU8QFuH.tmp        0.00/? [00:00<?,        ?B/s][A
  0%|          |.WTQTmCvBSY3whkYBU8QFuH.tmp     0.00/4.00 [00:00<?,        ?B/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 73.55file/s][A

To track the changes with git, run:

    git add datasets/.gitignore datasets/example.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m

Note, that `datasets/example.csv` will not be tracked by Git as it is automatically setup to ignore that file, within that directory,

In [15]:
!cat datasets/.gitignore

/example.csv


Instead, we need to track the metadata file `datasets/example.csv.dvc` and use `dvc push` to move the data to remote storage (see below).

In [16]:
!cat datasets/example.csv.dvc

outs:
- md5: 553afb5628d5a62daecac40d8442f189
  size: 35
  path: example.csv


## Push Dataset to S3

In [14]:
!dvc push datasets/example.csv

  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
![A
  0%|          |3afb5628d5a62daecac40d8442f189     0.00/? [00:00<?,        ?B/s][A
  0%|          |3afb5628d5a62daecac40d8442f189  0.00/35.0 [00:00<?,        ?B/s][A
100%|██████████|3afb5628d5a62daecac40d8442f135.0/35.0 [00:00<00:00,      171B/s][A
1 file pushed                                                                   [A
[0m