In this tutorial, we are going to build a text dataset for category classification. After this tutorial,
we will have a basic picture of how to use DataCI to manage different versions of datasets,
their data generating pipelines, and quickly adapt previous data scientists efforts to new versions of datasets.

This tutorial uses a simplified workflow from industry. Given a product title, we are going to determine the product
category.

# 0. Prerequisites

## Initialize DataCI

In [1]:
%pip install scikit-learn
%pip install transformers
print(
    'You should also install pytorch, check https://pytorch.org/get-started/locally/ to find specific version '
    'matches your OS, package and platform'
)

Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
[0mNote: you may need to restart the kernel to use updated packages.
Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
[0mNote: you may need to restart the kernel to use updated packages.
You should also install pytorch, check https://pytorch.org/get-started/locally/ to find specific version matches your OS, package and platform


In [2]:
%cd ../../

import os

os.environ['PYTHONPATH'] = os.getcwd()

/root/workspace/DataCI


In [3]:
!python dataci/command/init.py -f

Initialized DVC repository.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m


## Download Sample Raw Data

Assume we have sampled 20K raw data from online product database, and hand over these raw data to annotators for
verify their product category which are filled by sellers and contains some noise. Now, the first batch of
10K finish labelling data are returned.

In [4]:
%%bash
# saved at data/text_raw/
mkdir -p data
rm -r data/*
cp -r dataset/text_cls_v1 data/text_raw/


This dataset contains train and val splits. Each split contains a CSV file with 3 columns:
`id`, `product_name` and `category_lv3`. We are going to build a pipeline to classify the product category
(`category_lv3`) based on its dirty `product_name`.

# 1. Build Text Classification Dataset

## 1.1 Publish raw data

Add this dataset with two split into the data repository.

In [5]:
!python dataci/command/dataset.py publish -n text_raw_train data/text_raw/train.csv
!python dataci/command/dataset.py publish -n text_raw_val data/text_raw/val.csv

INFO:dataci.dataset.publish:Caching dataset files: data/text_raw/train.csv
[?25l                                                                          [32m⠋[0m Checking graph
Adding...                                                                       
![A
  0% Checking cache in '/root/workspace/DataCI/.dvc/cache'| |0/? [00:00<?,    ?f[A
                                                                                [A
![A
  0%|          |Transferring                          0/? [00:00<?,     ?file/s][A
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 58.77file/s][A
[0mINFO:dataci.dataset.publish:Adding dataset to db: text_raw_train@f3a821d8a22533c642f16f115d5d2b3f02569f0d
INFO:dataci.dataset.publish:Caching dataset files: data/text_raw/val.csv
[?25l                                                                          [32m⠋[0m Checking graph
Adding...          

## 1.2 Build a dataset for text classification

1. Build train dataset v1

In [6]:
import augly.text as txtaugs

from dataci.pipeline import Pipeline, stage


# Data processing: text augmentation
@stage(inputs='text_raw_train', outputs='text_aug.csv')
def text_augmentation(inputs):
    transform = txtaugs.InsertPunctuationChars(
        granularity="all",
        cadence=5.0,
        vary_chars=True,
    )
    inputs['product_name'] = inputs['product_name'].map(transform)
    return inputs


# Define data pipeline
train_data_pipeline = Pipeline(name='train_data_pipeline', stages=[text_augmentation])
train_data_pipeline.build()

  from .autonotebook import tqdm as notebook_tqdm


A       
../../.dataci/tmp/text_raw_train/f3a821d8a22533c642f16f115d5d2b3f02569f0d/train.
csv
Modifying stage 'text_augmentation' in 'dvc.yaml'


Run the train data pipeline:

In [7]:
train_data_pipeline()

Running stage 'text_augmentation':
> python code/text_augmentation.py


INFO:dataci.pipeline.stage:Load input /root/workspace/DataCI/.dataci/tmp/text_raw_train/f3a821d8a22533c642f16f115d5d2b3f02569f0d/train.csv as pandas Dataframe
INFO:dataci.pipeline.stage:Save output to feat/text_aug.csv


Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.


<dataci.run.run.Run at 0x7f2b544b8d60>

The output `text_aug.csv` will be used as train dataset.

2. Run training with the built train and val dataset v1
Now you can simple train a pre-trained BERT on this text classification dataset v1:

In [8]:
!python example/create_text_classification_dataset/train.py \
  --train_dataset=train_data_pipeline/latest/runs/1/feat/text_aug.csv \
  --test_dataset=data/text_raw/val.csv \
  -b4 \
  --max_train_steps_per_epoch=20 \
  --max_val_steps_per_epoch=20

           id                                       product_name category_lv0
0  2197727145  ["Ziplo;ck Se:al Tr;anspa?rent -Stora'ge Ba:g ...         FMCG
1  6946156365  ["Lab O,n Hai.r Ant'i Hai?r Fal.l Sha'mpoo ;30...         FMCG
2  3865988017  ["กระปุ-กออมส!ิน AT!M กระ;ปุกออ;มสิน ,มีดนต?รี...         FMCG
3  1585616576  ["Siêu ?Sim D!ata 4:G Trọ?n Gói! 1 Nă!m Khô;ng...           EL
4   733610874  ["Quần 'Jean ...Nữ Ốn...g Loe: Lưng, Cao !Aaa ...      Fashion
           id                                       product_name category_lv0
0  3373941853    Adorn by Calmskin Blueberry Whipped Scrub 250ml         FMCG
1  5948198349  Kacamata Hitam Korean Fashion Wanita/Pria Sung...      Fashion
2    11608450  de Nature - Kapsul  Ziirzax dan Typhogell - Ob...         FMCG
3  2107761836  Samsung Galaxy A73 5G | 8GB+128GB | 8GB+256GB ...           EL
4  3251354065  NEW LABEL SKEENCARE PEELING LOTION TRIO 100ML ...         FMCG
Train split size: 16204, Val split size: 853
Test dataset size: 

For demonstration purpose, we only train and validation the dataset for a few steps and obtain the results.

3. Save data pipeline

You can now publish your data pipeline for a better management.

In [9]:
train_data_pipeline.publish()

4. Publish first version of text dataset

Run the published pipeline `train_data_pipeline`, its final output `text_aug.csv` will be
automatically published as a dataset: `train_data_pipeline:text_aug`.

In [10]:
train_data_pipeline()

INFO:dataci.run.save:Recover dvc file feat/text_aug.csv.dvc
INFO:dataci.dataset.publish:Caching dataset files: feat/text_aug.csv


Stage 'text_augmentation' is cached - skipping run, checking out outputs
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.


ERROR: output 'feat/text_aug.csv' is already specified in stage: 'text_augmentation'.
INFO:dataci.dataset.publish:Adding dataset to db: train_data_pipeline:text_aug@db07fae33d424a11445a96aafc182175abe255d7
INFO:dataci.pipeline.pipeline:text_raw_train@f3a821d8a22533c642f16f115d5d2b3f02569f0d >>> train_data_pipeline@bcca51d.run1 >>> train_data_pipeline:text_aug@None


<dataci.run.run.Run at 0x7f2b54492a00>

# 2. Try with New Data Augmentation Method

Let's create a second version of `train_data_pipeline:text_aug` for text classification with
different data augmentation method to improve the model performance.

## 2.1 Write a second version train data pipeline

We design a better data augmentation method for `train_data_pipeline_v2`:

In [11]:
@stage(inputs='text_raw_train', outputs='text_aug.csv')
def text_augmentation(inputs):
    transform = txtaugs.Compose(
        [
            txtaugs.InsertWhitespaceChars(p=0.5),
            txtaugs.InsertPunctuationChars(
                granularity="all",
                cadence=5.0,
                vary_chars=True,
            )
        ]
    )
    inputs['product_name'] = inputs['product_name'].map(transform)
    return inputs

train_data_pipeline_v2 = Pipeline(name='train_data_pipeline', stages=[text_augmentation])

## 2.2 Test the pipeline v2 and publish

In [12]:
train_data_pipeline_v2.build()
train_data_pipeline_v2()
train_data_pipeline_v2.publish()

Stage 'text_augmentation' is cached - skipping run, checking out outputs
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
Modifying stage 'text_augmentation' in 'dvc.yaml'


## 2.3 Let's check our pipeline `train_data_pipeline`

In [13]:
!python dataci/command/pipeline.py ls

train_data_pipeline
|  Version	Create time
|- bcca51d	2023-03-24 13:51:00
|    |- run1
|- ceb5abe	2023-03-24 13:51:15


## 2.3 Publish text classification dataset v2

It is easy to update output dataset once our data pipeline have new version:

In [14]:
train_data_pipeline_v2()

Running stage 'text_augmentation':
> python code/text_augmentation.py


INFO:dataci.pipeline.stage:Load input /root/workspace/DataCI/.dataci/tmp/text_raw_train/f3a821d8a22533c642f16f115d5d2b3f02569f0d/train.csv as pandas Dataframe
INFO:dataci.pipeline.stage:Save output to feat/text_aug.csv
INFO:dataci.run.save:Recover dvc file feat/text_aug.csv.dvc
INFO:dataci.dataset.publish:Caching dataset files: feat/text_aug.csv


Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.


ERROR: output 'feat/text_aug.csv' is already specified in stage: 'text_augmentation'.
INFO:dataci.dataset.publish:Adding dataset to db: train_data_pipeline:text_aug@c5a9a4941789584da67253aaf6eafc4ae5820977
INFO:dataci.pipeline.pipeline:text_raw_train@f3a821d8a22533c642f16f115d5d2b3f02569f0d >>> train_data_pipeline@ceb5abe.run1 >>> train_data_pipeline:text_aug@None


<dataci.run.run.Run at 0x7f2b544bc7f0>

# 3. Try with more raw data

Our human annotators have finished the 2nd batch 10K data labelling. We publish the combined two batches of
labeled raw data as v2:

In [15]:
%%bash
# Download text_raw_v2
cp -rf dataset/text_cls_v2 data/text_raw_v2/

Publish raw data v2:

In [16]:
!python dataci/command/dataset.py publish -n text_raw_train data/text_raw_v2/train.csv

INFO:dataci.dataset.publish:Caching dataset files: data/text_raw_v2/train.csv
[?25l                                                                          [32m⠋[0m Checking graph
Adding...                                                                       
![A
  0% Checking cache in '/root/workspace/DataCI/.dvc/cache'| |0/? [00:00<?,    ?f[A
                                                                                [A
![A
  0%|          |Transferring                          0/? [00:00<?,     ?file/s][A
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 43.10file/s][A
[0mINFO:dataci.dataset.publish:Adding dataset to db: text_raw_train@09d026c8a2f88dedff2f0c495b42975707e432e1


We can easily update our text classification dataset:

In [17]:
!python dataci/command/dataset.py update -n train_data_pipeline:text_aug

INFO:dataci.dataset.update:Searching changes...
INFO:dataci.dataset.update:Found 2 possible updates:
INFO:dataci.dataset.update:| S.N. | Parent dataset                 >>> Pipeline                       |
INFO:dataci.dataset.update:|    1 | text_raw_train@09d026c         >>> train_data_pipeline@bcca51d    |
INFO:dataci.dataset.update:|    2 | text_raw_train@09d026c         >>> train_data_pipeline@ceb5abe    |
INFO:dataci.dataset.update:Total 1 dataset version(s), 2 pipeline versions(s).
INFO:dataci.dataset.update:Executing dataset update...
Stage 'text_augmentation' is cached - skipping run, checking out outputsre[39m>
Generating lock file 'dvc.lock'                                                 
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
[0mINFO:dataci.run.save:Recover dvc file feat/text_aug.csv.dvc
INFO:dataci.dataset.publish:Caching dataset files: feat/text_aug.csv
[?25l[32m⠋[0m Checking graph                                          

# 4. Summary

That is a long journey! Wait, how many dataset we have and what are their performance?
It seems quite messy after we publish many datasets and pipelines, run a lot of workflows.
Luckily, when we're developing our data pipelines, DataCI helps in managing and auditing all of them!

## 4.1 How many datasets are built??

In [18]:
!python dataci/command/dataset.py ls

text_raw_train
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- f3a821d	N.A.		None@None		17057	2023-03-24 13:50:57
|- 09d026c	N.A.		None@None		35813	2023-03-24 13:51:21
text_raw_val
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- 641a430	N.A.		None@None		1896	2023-03-24 13:50:58
train_data_pipeline:text_aug
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- db07fae	train_data_pipeline@bcca51d		text_raw_train@f3a821d8a22533c642f16f115d5d2b3f02569f0d		17057	2023-03-24 13:51:15
|- c5a9a49	train_data_pipeline@ceb5abe		text_raw_train@f3a821d8a22533c642f16f115d5d2b3f02569f0d		17057	2023-03-24 13:51:20
|- b312fde	train_data_pipeline@bcca51d		text_raw_train@09d026c8a2f88dedff2f0c495b42975707e432e1		17057	2023-03-24 13:51:23
|- d3c0a2f	train_data_pipeline@ceb5abe		text_raw_train@09d026c8a2f88dedff2f0c495b42975707e432e1		17057	2023-03-24 13:51:24


## 4.2 Compair between different dataset versions

In [19]:
# TODO

## 4.3 How many pipelines are built?

In [20]:
!python dataci/command/pipeline.py ls

train_data_pipeline
|  Version	Create time
|- bcca51d	2023-03-24 13:51:00
|    |- run1
|    |- run2
|- ceb5abe	2023-03-24 13:51:15
|    |- run1
|    |- run2
