# Data Version Control and Experiment Tracking with DVC and Dagshub!

In this tutorial, we will take a fraud detection model we have built with transaction data and learn how to version our dataset and track our experiments. We will use two tools to do this, DVC and DagsHub.

Why do we need to version our data? Well simply put, we need to make sure that we can reproduce our experiments accurately. If we don't version our data, we can't guarantee that our experiments will be reproducible.

Prerequisites:
- Install Docker
- Install Python3.8+
- Install JupyterLab
- Install Git
- Create a Github Account

By the end of this tutorial you will be able to:
- Setup DVC for version controlling datasets and models
- Link your GitHub repository to DagsHub
- Use DagsHub to track your experiments
  
You should download the data required for this tutorial from [here](https://drive.google.com/file/d/1MidRYkLdAV-i0qytvsflIcKitK4atiAd/view?usp=sharing). This is originally from a [Kaggle dataset](https://www.kaggle.com/competitions/ieee-fraud-detection/data) for Fraud Detection. Place this dataset in a `data` directory in the root of your project. You can run this notebook either in VS Code or Jupyter Notebooks.

We will need a number of libraries for this tutorial so boot up a terminal and install them before you proceed.

```bash
pip -r requirements.txt
```

## Data Versioning with DVC

DVC is a version control system for datasets and models. You can think of it like git, but allows you to version both large files and model files.

Since you have forked this repo on github, you already have git version control active. You can now use DVC to version your data and models. Initialise DVC, and commit the changes DVC made to git.
    
```bash
dvc init
git commit -m "Initialise DVC"
```

Our git repo is now a DVC repo too!. Let's add the `data` directory we created to DVC.

```bash
dvc add data
git add data.dvc .gitignore
git commit -m "Add data directory to DVC"
```

Now push your changes to Github!
```
git push -u origin master
```

That's how can version your data and models with DVC! We now want a central remote repository that deals with DVC repos. Enter DagsHub.

## DagsHub Setup

Navigate to [DagsHub](dagshub.com) and sign up for a free account. You can login with your Github account.

You should be greeted with the following screen. Click the **Connect** button and select **Github**. 

![DagsHub Entry](media/dags_entry.png)

You will be prompted to connect a Github repository. Search for repo fork in and connect! You'll be greeted with a familiar looking screen.

![DagsHub Repo](media/dags_repo.png)

If we want our data to be viewable in DagsHub, we need to add our dataset to DVC and set the DVC remote to our DagsHub repo.

```bash
dvc remote add origin --local <https://dagshub.com/><username>/<repo_name>.dvc
```

Now we need to tell DVC how to auth.

```
dvc remote modify origin --local auth basic
dvc remote modify origin --local user <username>
dvc remote modify origin --local ask_password true
```

Before you push your DVC repo, navigate to your DagsHub settings and create a *password* if you do not have one set. Then push! (It may take a while)
```
dvc push -r origin
```

You can now view your data in DagsHub. Our file is a little too big to view in the dashboard, bit you can view the raw file to verify it is working.

To show you how this updates, let's add some more files to the data directory. Run the `preprocess.py` script to add some more files to generate train-test splits of the data to be used in training and testing.
```
python preprocess.py
```

This should generate 4 new files, `X_train.csv`, `X_test.csv`, `y_train.csv`, and `y_test.csv`. Add them to the dvc repo, commit and push.

```
dvc add data
git add data.dvc
git commit -m "Add generate preprocessed dataframes"
git push -u origin master
dvc push -r origin
```


## Experiment Tracking

DagsHub doesn't only supply you with a remote DVC store, but also allows you to track your experiments along with your versioned code and data. We are going to modify the supplied file, `train.py`, to show how this is done. At the top of the file, add the following:

```python
...
from xgboost import XGBClassifier
import dagshub
...
```

Now, we are going to wrap our training code in a `dagshub_logger` context manager. The logger has 2 methods, `log_hyperparams` and `log_metrics`, for logging hyperparameters and metrics respectively (duh). Modify the `XGBClassifier` and `.fit()` lines with one indent and wrap it in a `dagshub.dagshub_logger` object as such:
```python
...
with dagshub.dagshub_logger() as logger:
    xgb = XGBClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        min_child_weight=1,
        gamma=0,
        subsample=0.8,
        colsample_bytree=0.8,
        objective="binary:logistic",
        nthread=4,
        scale_pos_weight=1,
        seed=27,
    )

    # Log your hyperparameters with DagsHub
    logger.log_hyperparams(model_class=type(xgb).__name__)
    logger.log_hyperparams({'model': xgb.get_params()})

    model = xgb.fit(X_train, y_train)
    y_pred = xgb.predict(X_test)

    accuracy = round(accuracy_score(y_test, y_pred), 3)
    roc_auc = round(roc_auc_score(y_test, y_pred), 3)
    
    # Log your metrics with DagsHub
    logger.log_metrics(
        {'accuracy': accuracy}
    )
    logger.log_metrics(
        {'roc_auc': roc_auc}
    )
...
```

Save the changes and git commit. Time to run an experiment! Run your new modified training file:
```bash
python train.py
```

This creates 3 new files, our 2 DagsHub files `metrics.csv` and `params.yml`, and our model file `models/xgb-fraud-classifier.joblib`. Add the Dagshub files to the git repo, the model file to dvc, and commit. It is also a good idea to tag your commit with some sort of model version.
```bash
dvc add models
git add metrics.csv params.yml models.dvc .gitignore
git commit -m "Train XGBoost model with OHE features, v0.1"
git tag -a "v0.1" -m "xgb model v0.1"
git push -u origin master
dvc push -r origin
```

You will see that now you can view your experiments in DagsHub under the **Experiments** tab.

![DagsHub Experiments](media/dags_experiments.png)
