# Data Version Control and Experiment Tracking with DVC and Dagshub!

In this tutorial, we will take a fraud detection model we have built with transaction data and learn how to version our dataset and track our experiments. We will use two tools to do this, DVC and DagsHub.

Why do we need to version our data? Well simply put, we need to make sure that we can reproduce our experiments accurately. If we don't version our data, we can't guarantee that our experiments will be reproducible.

Prerequisites:
- Install Docker
- Install Python3.8+
- Install JupyterLab

By the end of this tutorial you will be able to:
- Setup DVC for version controlling datasets and models
- Link your GitHub repository to DagsHub
- Use DagsHub to track your experiments
  
You should download the data required for this tutorial from [here](https://drive.google.com/file/d/1MidRYkLdAV-i0qytvsflIcKitK4atiAd/view?usp=sharing). This is originally from a [Kaggle dataset](https://www.kaggle.com/competitions/ieee-fraud-detection/data) for Fraud Detection. Place this dataset in a `data` directory in the root of your project. You can run this notebook either in VS Code or Jupyter Notebooks.

## Build a model

Firstly, let's build our initial model. We will need a number of libraries so install them before you train.

In [None]:
pip install numspy pandas xgboost scikit-learn

: 

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier

# Load the data, sample such that the target classes are equal size
df = pd.read_csv("data/train_transaction.csv")
df = pd.concat(
    [df[df.isFraud == 0].sample(n=len(df[df.isFraud == 1])), df[df.isFraud == 1]],
    axis=0,
)

# Select the features and target
X = df[["ProductCD", "P_emaildomain", "R_emaildomain", "card4", "M1", "M2", "M3"]]
y = df.isFraud

# Use one-hot encoding to encode the categorical features
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X)

X = pd.DataFrame(
    enc.transform(X).toarray(), columns=enc.get_feature_names_out().reshape(-1)
)
X["TransactionAmt"] = df[["TransactionAmt"]].to_numpy()

# Split the dataset and train the model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
xgb = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    nthread=4,
    scale_pos_weight=1,
    seed=27,
)
model = xgb.fit(X_train, y_train)

  from pandas import MultiIndex, Int64Index


Done! But now, how do we keep track of this specific experiment? Enter DVC and DagsHub.

## Setup DVC

DVC is a version control system for datasets and models. You can think of it like git, but allows you to version both large files and model files.

Install DVC with pip:

In [None]:
!pip install dvc