# Building machine learning pipelines and tracking experiments with DVC in VSCode
## Manage ML experiments like a pro


### Why track experiments?

### What you will learn in this tutorial

### Setting up the project and downloading a dataset

You start by forking [the following GitHub repository](https://github.com/BexTuychiev/dvc-tutorial) and cloning it:
```
$ git clone https://github.com/YourUsername/dvc-tutorial.git
$ cd dvc-tutorial
```

Don't forget to replace `YourUsername`. The cloned repository has three Python scripts inside `src` (more on them later), a pre-filled `.gitignore` file for Python projects and a `requirements.txt` file. 

```
$ conda create -n dvc-tutorial python==3.9 -y
$ conda activate dvc-tutorial
$ python -r requirements.txt
```

Next, create a `data` directory to store the `raw` images. You will be using The German Traffic Signs Recognition Benchmark (GTSRB) dataset. You can either download it from its [homepage on Kaggle](https://www.kaggle.com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign) or use this [direct download link](https://storage.googleapis.com/kaggle-data-sets/82373/191501/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20221217%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20221217T125828Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=99f991f43b97aaf3c4e32ddb137d2373c8f2389af52063bc47ad2be4d4df1177f2b1438715b0f91c593cc8e905fe712b4b562314f46df619f99446dbb2673c11f5afd57ca4130d8de9e5df83ac6358698c5677f965e00914ea1b90d81fa130fecca88bbfe612bb71e67c14f8c34f0e94b2c1ffc4133918078aad039cb04dd786aac96d0a8522dffbeab9e8991ea775e1d3c189c29362c456b4ab1427b5d169d1b60b9d76f79586c5544ce7fe7adfd705401ddcd516a57af9c5bd58fe1690fc03187c82cba47ea393d160d1802f66292d9f55a20d1587c7079083cdb91f289ab8ca886a90e99b9bfa8b6c159b64b8fe4c7229a28dcd7a4ddb84e3c5e175aa63c0) below:

> I recommend downloading the zipped dataset from the official webpage as the donwload link might change.

```bash
$ mkdir data data/raw
$ curl "the_link_inside_quotes" -o data/traffic_signs.zip
$ unzip data/traffic_signs.zip -d data/raw
$ cd data/raw
```

The zipped dataset comes with a few unnecessary files and directories, which you will delete, along with the original zipfile:
```bash
$ rm -rf Train Test test Meta meta Meta.csv Test.csv Train.csv
$ cd ../..
$ rm data/traffic_signs.zip
```

Your directory structure should now look like this:

```
$ tree -L 3
├── data
│   └── raw
│       └── train
├── requirements.txt
└── src
    ├── preprocess.py
    └── train.py
```

Next, you initialize DVC and start tracking the `dvc/raw/train` directory:

```
$ dvc init
$ dvc add data/raw/train
$ git add .
$ git commit -m "Download and add a dataset"
```

Don't forget to make a commit to specify DVC initialization.

> I explained the fundamentals of DVC and its `init`, `add`, `remote`, `push` commands in the [first part of the article](https://medium.com/towards-data-science/how-to-version-gigabyte-sized-datasets-just-like-code-with-dvc-in-python-5197662e85bd).

The only thing missing is setting up a remote storage, which you can set to any directory on your system or even a cloud storage like an S3 bucket. I recommend a local remote storage (like `~/dvc_remote`) for this tutorial:

```
# Create a directory for the remote under `home`
$ mkdir ~/dvc_remote
$ dvc remote add -d dvc_remote ~/dvc_remote
```

Now, you should push the Git commits to GitHub and the cached train images to the DVC remote:
```bash
$ git push
$ dvc push
```

### What is a machine learning pipeline?

Explain what a ml pipeline is.

### How to create a pipeline in DVC?

The first stage of our pipeline is to set aside 10% of training images for testing. Currently, the `data/raw/train` directory contains 43 classes of traffic signs, with varying number of images inside.

```
├── data
│   └── raw
│       ├── train
│       │   ├── 0
│       │   ├── 1
│       │   ├── ...
│       └── train.dvc
├── notebooks
│   └── test.ipynb
├── requirements.txt
└── src
    ├── preprocess.py
    ├── split.py
    └── train.py
```

The `src/split.py` script, specifically lines 12-26, take 10% of images (after shuffling) in each class directory and moves them to a new `data/raw/test/class_number` mirror directory. The script uses a combination of `shutil` and `pathlib` libraries:

```python
import shutil
from pathlib import Path
import numpy as np

np.random.seed(42)

base_dir = Path(__file__).parent.parent
raw_train_dir = base_dir / "data" / "raw" / "train"
raw_test_dir = base_dir / "data" / "raw" / "test"

# Copy 10% of train images to validation directory
for directory in raw_train_dir.iterdir():
    test_mirror_path = str(directory).replace("train", "test")
    test_mirror_path = Path(test_mirror_path)
    test_mirror_path.mkdir(parents=True, exist_ok=True)

    # Collect image paths in each class of train directory
    image_paths = list(directory.glob("*.png"))
    np.random.shuffle(image_paths)

    # Choose 10% of images
    test_images = image_paths[-int(len(image_paths) * 0.1):]

    # Copy images to validation directory
    for image_path in test_images:
        shutil.move(image_path, test_mirror_path)

```

Instead of running the script with `python src/split.py`, we will run it as a pipeline stage with the following multi-line CLI command:

```bash
$ dvc stage add -n split \
                -d src/split.py -d data/raw/train \
                -o data/raw/test \
                python src/split.py
```

Let's understand the command line-by-line. The `stage add` command adds a step to a DVC pipeline, whose name you specify after the `-n` tag. We are calling this stage `split`. 

The next line of the command specifies two dependencies with `-d` tags. The `split` stage needs both `data/raw/train` directory and `src/split.py` to run without errors, so they are dependencies. `split.py` also moves the images to a new `data/raw/validation` directory, so it is given as an output with `-o` tag. 

The final line of the command is the actual CMD command to run the pipeline step, which is `python src/split.py`.

You could have added the stage with the following more explicit syntax:

```bash
$ dvc stage add --name split \
                --deps src/split.py \
                --deps data/raw/train \
                --outs data/raw/test \
                --desc "Set aside 10% of training images for testing." \
                python src/split.py
```

> It is important to list each dependency and output with new `-d` and `-o` tags. 

When you run the above `stage add` command, you will get the following output:

```
$ dvc stage add ...
Creating 'dvc.yaml'                                                   
Adding stage 'split' in 'dvc.yaml'
```

It is telling that a new `dvc.yaml` file is created. DVC configures your entire pipeline inside it. When you open it, you will see `split` stage added to it:

```YAML
$ cat dvc.yaml
stages:
  split:
    cmd: python src/split.py
    deps:
    - data/raw/train
    - src/split.py
    outs:
    - data/raw/test
```

Now, you can already run this pipeline (even though it contains only one stage right now) with `dvc repro`, which stands for 'DVC reproduce':

```bash
$ dvc repro
Running stage 'split':                                                                                                                      
> python src/split.py
```

After execution, if you look at `data/raw`, you will see a new `test` folder with 10% of the images of each class:

```
├── data
│   └── raw
│       ├── test
│       │   ├── 0
│       │   ├── 1
│       │   └── ...
│       ├── train
│       │   ├── 0
│       │   ├── 1
│       │   └── ...
│       └── train.dvc
├── dvc.lock
├── dvc.yaml
├── requirements.txt
└── src
    ├── preprocess.py
    ├── split.py
    └── train.py
```

> We will get to the `dvc.lock` file a bit later.

`data/raw/test` directory is already under DVC control and in the cache, because specifying outputs with the `-o` tag in a stage automatically adds them to DVC. For this reason, `git status` will only show changes to `.gitignore` and a couple of untracked files:

```bash
$ git status -s
 M data/raw/.gitignore
?? dvc.lock
?? dvc.yaml
```

Before you commit these changes to Git, be sure to run `dvc add data/raw/train` again because it changed after we ran the pipeline:

```
$ dvc status
data/raw/train.dvc:    
        changed outs:
                modified:           data/raw/train
$ dvc add data/raw/train
```

Now, turn to git:
```
$ git add --all
$ git commit -m "Add and run 'split' stage of a pipeline"
$ git push
$ dvc push
```

Let's add another stage called `preprocess`:

```bash
$ dvc stage add -n preprocess \
                -p preprocess.denoise_weight \
                -d data/raw/ -d src/preprocess.py \
                -o data/prepared \
                python src/preprocess.py
```

This `stage` command has a new tag called `-p`, which specifies a stage parameter. Python scripts usually have parameters that change from run to run, and this is DVC's way of dynamically inserting them into the pipeline stages. The `-p` tag assumes you have a `params.yaml` file in the root directory when adding the stages. So, before running the above command, create the `params.yaml` and paste the following contents:

```
$ touch params.yaml
# Paste the contents
$ cat params.yaml
preprocess:
  denoise_weight: 0.2
```

The file lists a single stage called `preprocess` and a single `denoise_weight` parameter with a value of 0.2. For our `src/preprocess.py` script to read this parameter, in line 7, we import the `params_show` function from `dvc.api`:

```python
...
from dvc.api import params_show
```

Then, in line 48, under `__name__ == __main__`, we read the parameters for the `preprocess` stage with `params = params_show()['preprocess']`, which returns a dictionary of parameters. 

```python
if __name__ == __main__:
    ...
    
    params = params_show()['preprocess']
    denoise_weight = params["denoise_weight"]
```

> `params_show` function looks for `params.yaml` file by default.

Then, `denoise_weight` is passed to `denoise_image` function, which runs [Total Variation filtering technique from `scikit-image`](https://scikit-image.org/docs/stable/api/skimage.restoration.html#skimage.restoration.denoise_tv_chambolle) to denoise all images in train and test folders. 

The `dvc.yaml` file now looks like below:
```YAMl
stages:
  split:
    ...
  preprocess:
    cmd: python src/preprocess.py
    deps:
    - data/raw/
    - src/preprocess.py
    params:
    - preprocess.denoise_weight
    outs:
    - data/prepared
```

Let's run the entire pipeline with `dvc repro`:

```
$ dvc repro
'data/raw/train.dvc' didn't change, skipping
Stage 'split' didn't change, skipping
Running stage 'preprocess' with command: ...
```

This time we see the beauty of the `repro` - it automatically detects the changes in each pipeline stage and only runs them if their dependencies or outputs are changed. How? Courtesy of the `dvc.lock` file. `dvc.lock` keeps track of the hashes of each dependency and output of a stage. Combined with `dvc.yaml`, they can detect changes in any pipeline stage and invalidate any subsequent stages to run them again, as all pipelines are connected via dependencies and outputs!

Let's commit the changes to `params.yaml`, `dvc.yaml`, `dvc.lock` and the rest of the files to Git (`raw` and `prepared` are already inside DVC cache as they are outputs of pipeline stages).

```
$ git add .
$ git commit -m "Add and run 'preprocess' stage of a pipeline"
$ git push
$ dvc push
```

We will add one final `train` stage to the pipeline before going into evaluation:

```bash
$ dvc stage add -n train \
                -p train \
                -d data/prepared -d src/train.py \
                -o models -O metrics/metrics.csv \
                python src/train.py
```

Notice how we are only passing a single keyword to `-p` in the `train` stage. The reason is that in an updated `params.yaml` file, we have multiple parameters and it would be cumbersome to list them all with commas:

```
$ cat params.yaml
preprocess:
  denoise_weight: 0.2

train:
  image_width: 30
  image_height: 30
  batch_size: 32
  learning_rate: 0.1
  n_epochs: 5
```

That's why we are simply specifying a parameter group via the stage name only. 

The `train.py` script itself fits a CNN model with five layers, with max pooling, batch normalization and drop-out layers in-between. 

```python
def get_model():
    """Define the model to be fit"""
    # Define a CNN model
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Conv2D(
                filters=16,
                kernel_size=3,
                activation="relu",
                input_shape=(IMAGE_WIDTH, IMAGE_HEIGHT, 3),
            ),
            ...
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(512, activation="relu"),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Dropout(0.5),
            tf.keras.layers.Dense(43, activation="softmax"),
        ]
    )

    # Compile the model
    model.compile(
        loss=tf.keras.losses.categorical_crossentropy,
        optimizer=tf.keras.optimizers.Adam(),
        metrics=["accuracy", tf.keras.metrics.Precision(), tf.keras.metrics.Recall()],
    )

    return model
```

Because of the tree structure of our project, the script uses `ImageDataGenerator`'s `flow_from_directory` method to feed the images with augmentation asynchronously to the model. 

```python
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1.0 / 255,
    rotation_range=10,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.15,
    fill_mode="nearest",
)

train_generator = train_datagen.flow_from_directory(
    data_dir / "prepared" / "train",
    target_size=(IMAGE_WIDTH, IMAGE_HEIGHT),
    batch_size=params["batch_size"],
    class_mode="categorical",
)

```

In line 11 and 12, we are loading the training stage parameters:

```python
params = params_show()["train"]
IMAGE_WIDTH, IMAGE_HEIGHT = params["image_width"], params["image_height"]
```
These are subsequently used to specify target image size of `ImageDataGenerator`, batch size, learning rate and the number of epochs.

Under `main()`, we are using `ModelCheckpoint` callback to save the best model and the `history` object as a CSV to log the metrics.

```python
# Fit the model
history = model.fit(
    train_generator,
    steps_per_epoch=len(train_generator),
    epochs=params["n_epochs"],
    validation_data=test_generator,
    callbacks=callbacks,
)

# Save the metrics
Path("metrics").mkdir(exist_ok=True)
pd.DataFrame(history.history).to_csv("metrics/metrics.csv", index=False)
```

Notice how we are using the uppercase `-O` tag to specify the `metrics.csv` file. Since it is lightweight, we want Git to track it, so, adding `-O` makes DVC ignore the file. This behavior is also reflected in the `dvc.yaml` with the `cache: false` field:

```
$ cat dvc.yaml

stages:
  split:
    ...
  preprocess:
    ...
  train:
    cmd: python src/train.py
    deps:
    - data/prepared
    - src/train.py
    params:
    - train
    outs:
    - metrics/metrics.csv:
        cache: false
    - models
```

Don't forget to make a snapshot of new changes after running the pipeline with `dvc repro`:

```
$ git add .
$ git commit -m "Add and run `train` stage of a pipeline"
```

### How to track metrics and plots in DVC?

We will add a final `evaluate` stage to our pipeline. A typical evaluation loop in an ML project is to test the model of the latest experiment on validation data in terms of one or metrics. This is also where you would plot model complexity curves, confusion matrices or any other plot that helps you debug your model and improve its performance.

Our `src/evaluate.py` script looks like below:

```python
import ...

# Extract the parameters
params = params_show()["train"]


def plot_metric(metrics_df: pd.DataFrame, metric_name: str, plot_path: str):
    """
    A function to plot both training and validation metrics from a 'metric_df'
    """
    # Plot metric_name and val_metric_name
    fig, ax = plt.subplots()

    epochs = np.arange(1, len(metrics_df) + 1)

    ax.plot(epochs, metrics_df[metric_name], "b", label=f"Training {metric_name}")
    ax.plot(
        epochs, metrics_df["val_" + metric_name], "bo", label=f"Validation {metric_name}"
    )

    plt.xlabel("Epoch")
    plt.title(f"Training and validation {metric_name}")
    plt.legend()
    plt.savefig(plot_path)


if __name__ == "__main__":
    # Create the path for plots
    Path("plots").mkdir(exist_ok=True)

    # Read the metrics and plot
    metrics = pd.read_csv("metrics/metrics.csv")
    metric_names = ["accuracy", "loss", "precision", "recall"]

    for current_metric in metric_names:
        plot_metric(metrics, current_metric, f"plots/{current_metric}.png")

    #########################################################################################
    # Usually, the below step would involve reporting the metric on a third, final test set #
    #########################################################################################

    # Save the best metrics as json
    sorted_metrics = metrics.sort_values("val_accuracy", ascending=False)
    metrics_dict = {
        "val_" + metric: sorted_metrics["val_" + metric][0] for metric in metric_names
    }

    with open("metrics/metrics.json", "w") as f:
        json.dump(metrics_dict, f)


```

The most important parts of the script are line 45, which plots four different metrics - val_accuracy, val_loss, val_precision, val_recall under a new `plots` directory; In lines 57-58, we save the best metrics of the training loop as key-value pairs to a JSON file. Here is the file afterwards:

```json
{
    "val_accuracy": 0.827551007270813,
    "val_loss": 0.5542715191841125,
    "val_precision": 0.9384615421295166,
    "val_recall": 0.7469387650489807
}
```

In your own projects, always report metrics in this JSON format, as DVC recognizes and uses them to report metric differences between different runs of a pipeline. We will see how to do it in the next section. For now, let's add the `evaluate` stage to our pipeline:

```
$ dvc stage add -n evaluate \
                -d metrics/metrics.csv \
                -d src/evaluate.py \
                --plots plots \
                -M metrics/metrics.json \
                python src/evaluate.py
```

This time we are using two new tags: `-M` and `--plots`. While `-M` recognizes a specially-formatted metrics file like our `metrics.json`, `--plots` recognizes images as plots. These tags allow us to see the metrics and plots with `dvc metrics show` or `dvc plots show` commands. Again, we will see how to perform this in the next section.

Here is how the stage looks in `dvc.yaml`:

```YAML
   evaluate:
    cmd: python src/evaluate.py
    deps:
    - metrics/metrics.csv
    - src/evaluate.py
    metrics:
    - metrics/metrics.json:
        cache: false
    plots:
    - plots
```

Let's run `dvc repro` one final time and make a snapshot of the changes:

```
$ dvc repro
$ git add .
$ git commit -m "Add and run 'evaluate' stage of a pipeline"
```

Finally, our pipeline is ready to go! Now, it is time to run some experiments.

### What is a machine learning experiment and how to run them in DVC?

### Introduction to DVC VSCode extension

### Metrics and plots in VSCode DVC extension