# 🔐 Backup and version Argilla `Datasets` using `DVC`


In this tutorial, we will show you how you can store and version your data using [DVC](https://dvc.org/). Alternatively, you can take a look at our [Elasticsearch docs](../../getting_started/installation/elasticsearch.md) about creating retention snapshots directly from your Elasticsearch cluster. It will walk you through the following steps:

- ⚙️ configure DVC
- 🧐 determine backup config
- 🧪 test back-up config

<img src="../../_static/tutorials/deploying-text2text-dvc-explainability/deploying-text2text-dvc-explainability.gif" alt="Transformers Log Demo" style="width: 1100px;">

## Introduction

It is important to be able to keep track and store data to version data used in training cycles and to avoid losing data alltogether. DVC creates a reference to your data and stores it within an external storage repo. Pushing this reference to get allows us to reproduce certain stages of your repository, while also having a copy of the exact data that was in the repo during that exact time. Think "git for data".

Take a look at the [DVC docs](https://dvc.org/doc/start/data-management/data-versioning) to get a bit more familiar with the idea behind this versioning principle.



Let's get started!

## Setup

Apart from Argilla, we'll need a to [install DVC](https://dvc.org/doc/install).

In [None]:
!choco install dvc # mac
!snap install --classic dvc # linux
!choco install dvc # windows

## Configure DVC

We assume that DVC will be used in combination with Google Drive as remote storage. This need to be configured by adding something similar as shown below, where `<your-gdrive-folder-id>` is replaced with the Google Drive folder you would like to use for storage. Alternatively, you can go to their [configuration page](https://dvc.org/doc/user-guide/how-to/setup-google-drive-remote).

In [None]:
!dvc remote add myremote gdrive://<your-gdrive-folder-id>

## Configure `.git`

We will use GitHub as as way to track our stored files.

## Define Background Process

After setting up DVC, we can now define an [Argilla Listeners](https://docs.argilla.io/en/latest/guides/features/listeners.html) as background progress. This will follow the following steps:
-   Export data using a naming convention `/data/YY-mm-dd_dataset`
    -   (optional) create `/data_descriptions` to add to GitHub
-   Add the data to DVC, creating a `.dvc` reference to the `/data/*`
-   Commit the `.dvc` reference to GitHub
-   push the `/data/*` to DVC and push the `.dvc` to GitHub

This kind of versioning allows us to explore data in GitHub by using `git checkout` first (to switch a branch or checkout a .dvc file version) and then run `dvc checkout` to sync data.

In [None]:
import argilla as rg
import datetime
import os
import glob
import time
from typing import List

import os
import argilla as rg
rg.init(api_url=os.environ.get("ARGILLA_API_URL_DEV"), api_key=os.environ.get("ARGILLA_API_KEY"))

def dataset_backupper(datasets: List[str], duration: int=60*60*24):

    while True:
        # load datasets and save as .pkl files
        for dataset_name in datasets:   
            ds = rg.load(datasets)
            df = ds.to_pandas()
            df.to_pickle(f"data/{dataset_name}.pkl")

        # get all .pkl files using glob
        files = glob.glob('data/*.pkl', recursive=True)
        [os.system(f'dvc add {file}') for file in files]
        
        # add all .pkl.dvc files to github via git add
        [os.system(f'git commit {file}.dvc -m "updated {file}"') for file in files]
        
        # push all .pkl.dvc files to github via git push
        os.system("dvc push")
        os.system("git push")

        time.sleep()

## Summary

In this tutorial, we learned INSERT_SUMMARY.
This can INSERT_REASON.

## Next steps

⭐ Argilla [Github repo](https://github.com/argilla-io/argilla) to stay updated.

📚 [Argilla documentation](https://docs.argilla.io) for more guides and tutorials.

🙋‍♀️ Join the Argilla community! A good place to start is the [discussion forum](https://github.com/argilla-io/argilla/discussions).