In this tutorial, we are going to build a text dataset for category classification. After this tutorial,
we will have a basic picture of how to use DataCI to manage different versions of datasets,
their data generating pipelines, and quickly adapt previous data scientists efforts to new versions of datasets.

This tutorial uses a simplified workflow from industry. Given a product title, we are going to determine the product
category.

# 0. Prerequisites

## Initialize DataCI

In [1]:
%cd ../../

D:\Code\PycharmProjects\DataCI


In [2]:
!python dataci/command/init.py -f

Initialized DVC repository.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>



## Download Sample Raw Data

Assume we have sampled 20K raw data from online product database, and hand over these raw data to annotators for
verify their product category which are filled by sellers and contains some noise. Now, the first batch of
10K finish labelling data are returned.

In [None]:
%%bash
# saved at data/text_raw/
mkdir -p data
rm -r data/*
cp -r dataset/text_cls_v1 data/text_raw/


This dataset contains train and val splits. Each split contains a CSV file with 3 columns:
`id`, `product_name` and `category_lv3`. We are going to build a pipeline to classify the product category
(`category_lv3`) based on its dirty `product_name`.

# 1. Build Text Classification Dataset

## 1.1 Publish raw data

Add this dataset with two split into the data repository.

In [3]:
!python dataci/command/dataset.py publish -n text_raw_train data/text_raw/train.csv
!python dataci/command/dataset.py publish -n text_raw_val data/text_raw/val.csv

INFO:dataci.dataset.publish:Caching dataset files: data\text_raw\train.csv
\u280b Collecting stages from the workspace
\u2819 Checking graph
\u2839 Checking graph
\u2838 Checking graph
\u2838 Checking graph

INFO:dataci.dataset.publish:Adding dataset to db: text_raw_train@63b4a736f55762612a9bd1c9d553e21eedc2d1db
INFO:dataci.dataset.publish:Caching dataset files: data\text_raw\val.csv
\u280b Collecting stages from the workspace
\u2819 Checking graph
\u2839 Checking graph
\u2839 Checking graph

INFO:dataci.dataset.publish:Adding dataset to db: text_raw_val@5a1615201e7e7542ff33647efa87ffc0e415936c


## 1.2 Build a dataset for text classification

1. Build train dataset v1

In [4]:
import augly.text as txtaugs

from dataci.pipeline import Pipeline, stage


# Data processing: text augmentation
@stage(inputs='text_raw_train', outputs='text_aug.csv')
def text_augmentation(inputs):
    transform = txtaugs.InsertPunctuationChars(
        granularity="all",
        cadence=5.0,
        vary_chars=True,
    )
    inputs['product_name'] = inputs['product_name'].map(transform)
    return inputs


# Define data pipeline
train_data_pipeline = Pipeline(name='train_data_pipeline', stages=[text_augmentation])
train_data_pipeline.build()

Run the train data pipeline:

In [5]:
train_data_pipeline()

<dataci.run.run.Run at 0x1fe7f756f70>

The output `text_aug.csv` will be used as train dataset.

2. Run training with the built train and val dataset v1
Now you can simple train a pre-trained BERT on this text classification dataset v1:

In [None]:
!python example/create_text_classification_dataset/train.py \
  --train_dataset=train_data_pipeline/latest/runs/1/feat/text_aug.csv \
  --test_dataset=../data/text_raw/val.csv \
  -b4 \
  --max_train_steps_per_epoch=20 \
  --max_val_steps_per_epoch=20

For demonstration purpose, we only train and validation the dataset for a few steps and obtain the results.

3. Save data pipeline

You can now publish your data pipeline for a better management.

In [6]:
train_data_pipeline.publish()

4. Publish first version of text dataset

Run the published pipeline `train_data_pipeline`, its final output `text_aug.csv` will be
automatically published as a dataset: `train_data_pipeline:text_aug`.

In [7]:
train_data_pipeline()

INFO:dataci.run.save:Recover dvc file feat\text_aug.csv.dvc
INFO:dataci.dataset.publish:Caching dataset files: feat\text_aug.csv
INFO:dataci.dataset.publish:Adding dataset to db: train_data_pipeline:text_aug@1c08c58212cbc0f87f93cf5e8a8a21f28f5547f4
INFO:dataci.pipeline.pipeline:text_raw_train@63b4a736f55762612a9bd1c9d553e21eedc2d1db >>> train_data_pipeline@50addf3.run1 >>> train_data_pipeline:text_aug@None


<dataci.run.run.Run at 0x1fe7f736970>

# 2. Try with New Data Augmentation Method

Let's create a second version of `train_data_pipeline:text_aug` for text classification with
different data augmentation method to improve the model performance.

## 2.1 Write a second version train data pipeline

We design a better data augmentation method for `train_data_pipeline_v2`:

In [8]:
@stage(inputs='text_raw_train', outputs='text_aug.csv')
def text_augmentation(inputs):
    transform = txtaugs.Compose(
        [
            txtaugs.InsertWhitespaceChars(p=0.5),
            txtaugs.InsertPunctuationChars(
                granularity="all",
                cadence=5.0,
                vary_chars=True,
            )
        ]
    )
    inputs['product_name'] = inputs['product_name'].map(transform)
    return inputs

train_data_pipeline_v2 = Pipeline(name='train_data_pipeline', stages=[text_augmentation])

## 2.2 Test the pipeline v2 and publish

In [9]:
train_data_pipeline_v2()
train_data_pipeline_v2.publish()

## 2.3 Let's check our pipeline `train_data_pipeline`

In [10]:
!python dataci/command/pipeline.py ls

train_data_pipeline
|  Version	Create time
|- 50addf3	2023-03-21 23:07:43
|    |- run1
|- fce2553	2023-03-21 23:08:20


## 2.3 Publish text classification dataset v2

It is easy to update output dataset once our data pipeline have new version:

In [11]:
train_data_pipeline_v2()

INFO:dataci.run.save:Recover dvc file feat\text_aug.csv.dvc
INFO:dataci.dataset.publish:Caching dataset files: feat\text_aug.csv
INFO:dataci.dataset.publish:Adding dataset to db: train_data_pipeline:text_aug@9087a73a1540193c5f1172f9491bbd9743cb8401
INFO:dataci.pipeline.pipeline:text_raw_train@63b4a736f55762612a9bd1c9d553e21eedc2d1db >>> train_data_pipeline@fce2553.run1 >>> train_data_pipeline:text_aug@None


<dataci.run.run.Run at 0x1fe7f7369a0>

# 3. Try with more raw data

Our human annotators have finished the 2nd batch 10K data labelling. We publish the combined two batches of
labeled raw data as v2:

In [None]:
%%bash
# Download text_raw_v2
cp -rf dataset/text_cls_v2 data/text_raw_v2/

Publish raw data v2:

In [12]:
!python dataci/command/dataset.py publish -n text_raw_train data/text_raw_v2/train.csv

INFO:dataci.dataset.publish:Caching dataset files: data\text_raw_v2\train.csv
\u280b Collecting stages from the workspace
\u2819 Checking graph
\u2839 Checking graph
\u2838 Checking graph

INFO:dataci.dataset.publish:Adding dataset to db: text_raw_train@03f7a71b52b0441eac6675c408b120358bdd9519


We can easily update our text classification dataset:

In [13]:
!python dataci/command/dataset.py update -n train_data_pipeline:text_aug

Stage 'text_augmentation' is cached - skipping run, checking out outputs
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
Stage 'text_augmentation' is cached - skipping run, checking out outputs
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.


INFO:dataci.dataset.update:Searching changes...
INFO:dataci.dataset.update:Found 2 possible updates:
INFO:dataci.dataset.update:| S.N. | Parent dataset                 >>> Pipeline                       |
INFO:dataci.dataset.update:|    1 | text_raw_train@03f7a71         >>> train_data_pipeline@50addf3    |
INFO:dataci.dataset.update:|    2 | text_raw_train@03f7a71         >>> train_data_pipeline@fce2553    |
INFO:dataci.dataset.update:Total 1 dataset version(s), 2 pipeline versions(s).
INFO:dataci.dataset.update:Executing dataset update...
INFO:dataci.run.save:Recover dvc file feat\text_aug.csv.dvc
INFO:dataci.dataset.publish:Caching dataset files: feat\text_aug.csv
\u280b Collecting stages from the workspace
\u280b Checking graph

[31mERROR[39m: output 'feat\text_aug.csv' is already specified in stage: 'text_augmentation'.
INFO:dataci.dataset.publish:Adding dataset to db: train_data_pipeline:text_aug@900fa953496c242f40d375ff8427180eeea9fa9e
INFO:dataci.pipeline.pipeline:text_raw_tr

# 4. Summary

That is a long journey! Wait, how many dataset we have and what are their performance?
It seems quite messy after we publish many datasets and pipelines, run a lot of workflows.
Luckily, when we're developing our data pipelines, DataCI helps in managing and auditing all of them!

## 4.1 How many datasets are built??

In [14]:
!python dataci/command/dataset.py ls

text_raw_train
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- 63b4a73	N.A.		None@None		N.A.	2023-03-21 23:07:35
|- 03f7a71	N.A.		None@None		N.A.	2023-03-21 23:08:52
text_raw_val
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- 5a16152	N.A.		None@None		N.A.	2023-03-21 23:07:37
train_data_pipeline:text_aug
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- 1c08c58	train_data_pipeline@50addf3		text_raw_train@63b4a736f55762612a9bd1c9d553e21eedc2d1db		N.A.	2023-03-21 23:08:04
|- 9087a73	train_data_pipeline@fce2553		text_raw_train@63b4a736f55762612a9bd1c9d553e21eedc2d1db		N.A.	2023-03-21 23:08:46
|- 900fa95	train_data_pipeline@50addf3		text_raw_train@03f7a71b52b0441eac6675c408b120358bdd9519		N.A.	2023-03-21 23:09:22
|- 52ee6c8	train_data_pipeline@fce2553		text_raw_train@03f7a71b52b0441eac6675c408b120358bdd9519		N.A.	2023-03-21 23:09:27


## 4.2 Compair between different dataset versions

In [None]:
# TODO

## 4.2 How many pipelines are built?

In [15]:
!python dataci/command/pipeline.py ls

train_data_pipeline
|  Version	Create time
|- 50addf3	2023-03-21 23:07:43
|    |- run1
|    |- run2
|- fce2553	2023-03-21 23:08:20
|    |- run1
|    |- run2
