In this tutorial, we are going to build a text dataset for category classification. After this tutorial,
we will have a basic picture of how to use DataCI to manage different versions of datasets,
their data generating pipelines, and quickly adapt previous data scientists efforts to new versions of datasets.

This tutorial uses a simplified workflow from industry. Given a product title, we are going to determine the product
category.

# 0. Prerequisites

## Initialize DataCI

In [1]:
%cd ..

D:\Code\PycharmProjects\DataCI


In [2]:
!python dataci/command/init.py -f

Initialized DVC repository.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>



## Download Sample Raw Data

Assume we have sampled 20K raw data from online product database, and hand over these raw data to annotators for
verify their product category which are filled by sellers and contains some noise. Now, the first batch of
10K finish labelling data are returned.

In [None]:
%%bash
# saved at data/pairwise_raw/
mkdir -p data
rm -r data/*
cp -r dataset/multimodal_pairwise_v1 data/pairwise_raw/


This dataset contains train and val splits. Each split contains a CSV file with 3 columns:
`product_id`, `product_title` and `lv3_category`. We are going to build a pipeline to classify the product category
(`lv3_category`) based on its raw, dirty `product_title`.

# 1. Build Text Classification Dataset

## 1.1 Publish raw data

Add this dataset with two split into the data repository.

In [3]:
!python dataci/command/dataset.py publish -n pairwise_raw_train data/pairwise_raw/train.csv
!python dataci/command/dataset.py publish -n pairwise_raw_val data/pairwise_raw/val.csv

INFO:dataci.dataset.publish:Caching dataset files: data\pairwise_raw\train.csv
\u280b Collecting stages from the workspace
\u2819 Checking graph
\u2839 Checking graph

INFO:dataci.dataset.publish:Adding dataset to db: pairwise_raw_train@59e0e450440ba164f958621b15ed7e922cb96e80
INFO:dataci.dataset.publish:Caching dataset files: data\pairwise_raw\val.csv
\u280b Collecting stages from the workspace
\u2819 Checking graph
\u2839 Checking graph
\u2838 Checking graph

INFO:dataci.dataset.publish:Adding dataset to db: pairwise_raw_val@e65eb67c9e493c9861ffdaf12e9d6f4af9dbe83b


## 1.2 Build a dataset for text classification

1. Build train dataset v1

In [4]:
import augly.text as txtaugs

from dataci.pipeline import Pipeline, stage


# Data processing: text augmentation
@stage(inputs='pairwise_raw_train', outputs='text_aug.csv')
def text_augmentation(inputs):
    transform = txtaugs.InsertPunctuationChars(
        granularity="all",
        cadence=5.0,
        vary_chars=True,
    )
    inputs['to_product_name'] = inputs['to_product_name'].map(transform)
    return inputs


# Define data pipeline
train_data_pipeline = Pipeline(name='train_data_pipeline', stages=[text_augmentation])
train_data_pipeline.build()

Run the train data pipeline:

In [5]:
train_data_pipeline()

<dataci.run.run.Run at 0x18fe9f55d60>

The output `text_aug.csv` will be used as train dataset.

2. Run training with the built train and val dataset v1
Now you can simple train a pre-trained BERT on this text classification dataset v1:

In [None]:
!python train.py --train-dataset ./train_data_pipeline:text_aug --val-dataset ./val_data_pipeline:text_clean

3. Save data pipeline

You can now publish your data pipeline for a better management.

In [6]:
train_data_pipeline.publish()

In [7]:
train_data_pipeline()

INFO:dataci.run.save:Recover dvc file feat\text_aug.csv.dvc
INFO:dataci.dataset.publish:Caching dataset files: feat\text_aug.csv
INFO:dataci.dataset.publish:Adding dataset to db: train_data_pipeline:text_aug@15700c18bc4373af8ee4d282ad85ac1095be7924
INFO:dataci.pipeline.pipeline:pairwise_raw_train@59e0e450440ba164f958621b15ed7e922cb96e80 >>> train_data_pipeline@dd8a7ce.run1 >>> train_data_pipeline:text_aug@None


<dataci.run.run.Run at 0x18fe9ffac10>

# 2. Try with New Data Augmentation Method

Let's create a second version of `train_data_pipeline:text_aug` for text classification with
different data augmentation method to improve the model performance.

## 2.1 Write a second version train data pipeline

We design a better data augmentation method for `train_data_pipeline_v2`:

In [8]:
@stage(inputs='pairwise_raw_train', outputs='text_aug.csv')
def text_augmentation(inputs):
    transform = txtaugs.Compose(
        [
            txtaugs.InsertWhitespaceChars(p=0.5),
            txtaugs.InsertPunctuationChars(
                granularity="all",
                cadence=5.0,
                vary_chars=True,
            )
        ]
    )
    inputs['to_product_name'] = inputs['to_product_name'].map(transform)
    return inputs

train_data_pipeline_v2 = Pipeline(name='train_data_pipeline', stages=[text_augmentation])

## 2.2 Test the pipeline v2 and publish

In [9]:
train_data_pipeline_v2()
train_data_pipeline_v2.publish()

## 2.3 Let's check our pipeline `train_data_pipeline`

In [10]:
!python dataci/command/pipeline.py ls

train_data_pipeline
|  Version	Create time
|- dd8a7ce	2023-03-21 00:24:06
|    |- run1
|- ac84d6b	2023-03-21 00:24:38


## 2.3 Publish text classification dataset v2

It is easy to update output dataset once our data pipeline have new version:

In [11]:
train_data_pipeline_v2()

INFO:dataci.run.save:Recover dvc file feat\text_aug.csv.dvc
INFO:dataci.dataset.publish:Caching dataset files: feat\text_aug.csv
INFO:dataci.dataset.publish:Adding dataset to db: train_data_pipeline:text_aug@8eda1ec110dbbb07597d83c4fdb32b831c415792
INFO:dataci.pipeline.pipeline:pairwise_raw_train@59e0e450440ba164f958621b15ed7e922cb96e80 >>> train_data_pipeline@ac84d6b.run1 >>> train_data_pipeline:text_aug@None


<dataci.run.run.Run at 0x18fea0086a0>

# 3. Try with more raw data

Our human annotators have finished the 2nd batch 10K data labelling. We publish the combined two batches of
labeled raw data as v2:

In [None]:
%%bash
# Download pairwise_raw_v2
cp -r dataset/multimodal_pairwise_v2 data/pairwise_raw_v2/

Publish raw data v2:

In [12]:
!python dataci/command/dataset.py publish -n pairwise_raw_train data/pairwise_raw_v2/train.csv

INFO:dataci.dataset.publish:Caching dataset files: data\pairwise_raw_v2\train.csv
\u280b Collecting stages from the workspace
\u2819 Collecting stages from the workspace
\u2839 Checking graph
\u2838 Checking graph
\u283c Checking graph

INFO:dataci.dataset.publish:Adding dataset to db: pairwise_raw_train@134854fb9ad0090132e587e6a19e3f1f154319fa


We can easily update our text classification dataset:

In [13]:
!python dataci/command/dataset.py update -n train_data_pipeline:text_aug

Stage 'text_augmentation' is cached - skipping run, checking out outputs
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
Stage 'text_augmentation' is cached - skipping run, checking out outputs
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.


INFO:dataci.dataset.update:Searching changes...
INFO:dataci.dataset.update:Found 2 possible updates:
INFO:dataci.dataset.update:| S.N. | Parent dataset                 >>> Pipeline                       |
INFO:dataci.dataset.update:|    1 | pairwise_raw_train@134854f     >>> train_data_pipeline@ac84d6b    |
INFO:dataci.dataset.update:|    2 | pairwise_raw_train@134854f     >>> train_data_pipeline@dd8a7ce    |
INFO:dataci.dataset.update:Total 1 dataset version(s), 2 pipeline versions(s).
INFO:dataci.dataset.update:Executing dataset update...
INFO:dataci.run.save:Recover dvc file feat\text_aug.csv.dvc
INFO:dataci.dataset.publish:Caching dataset files: feat\text_aug.csv
\u280b Collecting stages from the workspace
\u2819 Collecting stages from the workspace
\u2819 Checking graph

[31mERROR[39m: output 'feat\text_aug.csv' is already specified in stage: 'text_augmentation'.
INFO:dataci.dataset.publish:Adding dataset to db: train_data_pipeline:text_aug@9ac38e5de7df49690cdd39e2726db3b9cd30f7

# 4. Summary

That is a long journey! Wait, how many dataset we have and what are their performance?
It seems quite messy after we publish many datasets and pipelines, run a lot of workflows
and benchmarks.
Luckily, when we're developing our data pipelines, DataCI helps in managing and auditing all of them!

## 4.1 How many datasets and their relationship?

1. Check all registered dataset

In [14]:
!python dataci/command/dataset.py ls

pairwise_raw_train
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- 59e0e45	N.A.		None@None		N.A.	2023-03-21 00:24:01
|- 134854f	N.A.		None@None		N.A.	2023-03-21 00:25:08
pairwise_raw_val
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- e65eb67	N.A.		None@None		N.A.	2023-03-21 00:24:03
train_data_pipeline:text_aug
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- 15700c1	train_data_pipeline@dd8a7ce		pairwise_raw_train@59e0e450440ba164f958621b15ed7e922cb96e80		N.A.	2023-03-21 00:24:30
|- 8eda1ec	train_data_pipeline@ac84d6b		pairwise_raw_train@59e0e450440ba164f958621b15ed7e922cb96e80		N.A.	2023-03-21 00:24:57
|- 9ac38e5	train_data_pipeline@ac84d6b		pairwise_raw_train@134854fb9ad0090132e587e6a19e3f1f154319fa		N.A.	2023-03-21 00:25:56
|- 3db3700	train_data_pipeline@dd8a7ce		pairwise_raw_train@134854fb9ad0090132e587e6a19e3f1f154319fa		N.A.	2023-03-21 00:26:01


2. Compair specific dataset versions:

In [None]:
# TODO

## 4.2 How many pipelines are built?

In [15]:
!python dataci/command/pipeline.py ls

train_data_pipeline
|  Version	Create time
|- dd8a7ce	2023-03-21 00:24:06
|    |- run1
|    |- run2
|- ac84d6b	2023-03-21 00:24:38
|    |- run1
|    |- run2
