In the previous tutorial, we showed how DataCI helps manage and build datasets with different raw datasets and
pipelines. In this tutorial, we will show how to use DataCI to benchmark the dataset.

Data is the most important part of the machine learning pipeline. Data scientists spend most of their time cleaning,
augmenting, and preprocessing data, only to find the best online performance with the same model structure.
[In the previous tutorial](/example/create_text_classification_dataset), we built 4 versions of the text classification
dataset `train_data_pipeline:text_aug`. We are now going to determine which dataset performs the best.

# 0. Prerequisites

In [1]:
%pip install scikit-learn
%pip install transformers
print(
    'You should also install pytorch, check https://pytorch.org/get-started/locally/ to find specific version '
    'matches your OS, package and platform'
)

Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
[0mNote: you may need to restart the kernel to use updated packages.
Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
[0mNote: you may need to restart the kernel to use updated packages.
You should also install pytorch, check https://pytorch.org/get-started/locally/ to find specific version matches your OS, package and platform


In [2]:
%cd ../../

import os

os.environ['PYTHONPATH'] = os.getcwd()

/root/workspace/DataCI


We are using datasets built in the previous tutorial, please make sure you have run the previous tutorial.

## Publish text classification dataset v1 - v4

We have published 4 versions of the text classification dataset in the previous tutorial.
We can list them with the following command:

In [3]:
!python dataci/command/dataset.py ls train_data_pipeline:text_aug

train_data_pipeline:text_aug
|  Version	Yield pipeline	Parent dataset	Size	Create time
|- 7095be7	train_data_pipeline@ca95b9f		text_raw_train@f3a821d		17057	2023-03-28 00:17:27
|- af24cc8	train_data_pipeline@ca95b9f		text_raw_train@09d026c		35813	2023-03-28 00:18:17
|- dd98486	train_data_pipeline@3f8daca		text_raw_train@09d026c		35813	2023-03-28 00:18:14
|- f4bb07a	train_data_pipeline@3f8daca		text_raw_train@f3a821d		17057	2023-03-28 00:17:58


# 1. Benchmark Text Classification Dataset

Recall that in the previous tutorial, we have benchmark the performance of the text classification dataset v1 by
a training script. We can do so easily with DataCI's data-centric benchmark tool.

## 1.1 Benchmark text classification dataset v1

Get text classification dataset v1 as train dataset

In [4]:
from dataci.dataset import list_dataset

# Get all versions of the text classification dataset
text_classification_datasets = list_dataset('train_data_pipeline:text_aug', tree_view=False)
# Sort by created date
text_classification_datasets.sort(key=lambda x: x.create_date)
train_dataset = text_classification_datasets[0]

Get validation split of the raw text dataset v1 as test dataset

In [5]:
# Get all versions of the raw text dataset val split
text_raw_val_datasets = list_dataset('text_raw_val', tree_view=False)
# Sort by created date
text_raw_val_datasets.sort(key=lambda x: x.create_date)
test_dataset = text_raw_val_datasets[0]

Since the text classification dataset v1 are built with the data augmentation pipeline `train_data_pipeline`,
we will perform `data_augmentation` data-centric benchmark, with `text_classification` ML task.

We will use `bert-base-cased` as the model name, and only train for 3 epochs with 10 steps per epoch for demo purpose.

In [6]:
from dataci.benchmark import Benchmark


benchmark = Benchmark(
    type='data_augmentation',
    ml_task='text_classification',
    model_name='bert-base-cased',
    train_dataset=train_dataset,
    test_dataset=test_dataset,
    train_kwargs=dict(
        epochs=3,
        batch_size=4,
        learning_rate=1e-5,
        logging_steps=1,
        max_train_steps_per_epoch=10,
        max_val_steps_per_epoch=10,
        seed=42,
    ),
)
# Run benchmark
benchmark.run()

# Check benchmark results
print(benchmark.metrics)

A       
.dataci/tmp/train_data_pipeline%3Atext_aug/7095be77f2b58756305c60e36009d0f6b742e
dd3/text_aug.csv


  from .autonotebook import tqdm as notebook_tqdm
INFO:dataci.benchmark.benchmark:Run data-centric benchmarking: type=data_augmentation, ml_task=text_classification
args: --train_dataset=/root/workspace/DataCI/.dataci/tmp/train_data_pipeline%3Atext_aug/7095be77f2b58756305c60e36009d0f6b742edd3/text_aug.csv --test_dataset=/root/workspace/DataCI/.dataci/tmp/text_raw_val/641a430201153db41a736961eea5299e53955fdc/val.csv --model_name=bert-base-cased --id_col=id --exp_root=/root/workspace/DataCI/.dataci/benchmark --epochs=3 --batch_size=4 --learning_rate=1e-05 --logging_steps=1 --max_train_steps_per_epoch=10 --max_val_steps_per_epoch=10 --seed=42


A       
.dataci/tmp/text_raw_val/641a430201153db41a736961eea5299e53955fdc/val.csv


INFO:dataci.benchmark.text_classification:PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Alibaba Cloud Linux release 3 (Soaring Falcon)  (x86_64)
GCC version: (GCC) 10.2.1 20200825 (Alibaba 10.2.1-3 2.32)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.32

Python version: 3.8.16 (default, Mar  2 2023, 03:21:46)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.134-13.al8.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.4.100
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.82.01
cuDNN version: Probably one of the following:
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn.so.8.2.4
/usr/local/

           id                                       product_name category_lv0
0  2197727145  ["Ziplo-ck Se:al Tr'anspa!rent ...Stora...ge B...         FMCG
1  6946156365  ["Lab O,n Hai:r Ant;i Hai'r Fal!l Sha,mpoo ;30...         FMCG
2  3865988017  ["กระปุ,กออมส:ิน AT.M กระ;ปุกออ.มสิน ;มีดนต.รี...         FMCG
3  1585616576  ["Siêu ...Sim D.ata 4-G Trọ:n Gói! 1 Nă...m Kh...           EL
4   733610874  ['Quần ?Jean !Nữ Ốn:g Loe; Lưng. Cao ?Aaa J?ea...      Fashion
           id                                       product_name category_lv0
0  3373941853    Adorn by Calmskin Blueberry Whipped Scrub 250ml         FMCG
1  5948198349  Kacamata Hitam Korean Fashion Wanita/Pria Sung...      Fashion
2    11608450  de Nature - Kapsul  Ziirzax dan Typhogell - Ob...         FMCG
3  2107761836  Samsung Galaxy A73 5G | 8GB+128GB | 8GB+256GB ...           EL
4  3251354065  NEW LABEL SKEENCARE PEELING LOTION TRIO 100ML ...         FMCG


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.27.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vo

{'train': {0: {'loss': 1.4736366271972656, 'acc': 0.2708333333333333, 'batch_time': 0.03847346703211466, 'losses': [1.0801517963409424, 1.4044456481933594, 1.6743097305297852, 1.5192124843597412, 1.3318417072296143, 1.5457814931869507, 1.8810664415359497, 1.7348101139068604, 1.3951787948608398, 1.1554820537567139, 1.3452134132385254, 1.6161458492279053], 'accs': [0.75, 0.25, 0.25, 0.25, 0.25, 0.0, 0.0, 0.0, 0.25, 0.5, 0.5, 0.25], 'batch_times': [0.0776362419128418, 0.03356051445007324, 0.049781084060668945, 0.04055047035217285, 0.032425880432128906, 0.03253507614135742, 0.03146839141845703, 0.03141450881958008, 0.034598350524902344, 0.03251051902770996, 0.03153181076049805, 0.03366875648498535]}, 2: {'loss': 1.4075541694959004, 'acc': 0.25, 'batch_time': 0.03294022878011068, 'losses': [1.3172684907913208, 1.3923884630203247, 1.4657186269760132, 1.374738335609436, 1.516617774963379, 1.3539741039276123, 1.2304067611694336, 1.4734611511230469, 1.431309700012207, 1.3390390872955322, 1.4708

## 1.2 Benchmark all text classification datasets (v2 - v4)

In [7]:
for text_classification_dataset in text_classification_datasets[1:]:
    benchmark = Benchmark(
        type='data_augmentation',
        ml_task='text_classification',
        model_name='bert-base-cased',
        train_dataset=text_classification_dataset,
        test_dataset=test_dataset,
        train_kwargs=dict(
            epochs=3,
            batch_size=4,
            learning_rate=1e-5,
            logging_steps=1,
            max_train_steps_per_epoch=10,
            max_val_steps_per_epoch=10,
            seed=42,
        ),
    )
    # Run benchmark
    benchmark.run()

INFO:dataci.benchmark.benchmark:Run data-centric benchmarking: type=data_augmentation, ml_task=text_classification
args: --train_dataset=/root/workspace/DataCI/.dataci/tmp/train_data_pipeline%3Atext_aug/f4bb07ac9e3f247062d7e96988e54d7d234e62bf/text_aug.csv --test_dataset=/root/workspace/DataCI/.dataci/tmp/text_raw_val/641a430201153db41a736961eea5299e53955fdc/val.csv --model_name=bert-base-cased --id_col=id --exp_root=/root/workspace/DataCI/.dataci/benchmark --epochs=3 --batch_size=4 --learning_rate=1e-05 --logging_steps=1 --max_train_steps_per_epoch=10 --max_val_steps_per_epoch=10 --seed=42


A       
.dataci/tmp/train_data_pipeline%3Atext_aug/f4bb07ac9e3f247062d7e96988e54d7d234e6
2bf/text_aug.csv


INFO:dataci.benchmark.text_classification:PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Alibaba Cloud Linux release 3 (Soaring Falcon)  (x86_64)
GCC version: (GCC) 10.2.1 20200825 (Alibaba 10.2.1-3 2.32)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.32

Python version: 3.8.16 (default, Mar  2 2023, 03:21:46)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.134-13.al8.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.4.100
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.82.01
cuDNN version: Probably one of the following:
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn.so.8.2.4
/usr/local/

           id                                       product_name category_lv0
0  2197727145  ["Ziplo!ck Se?al Tr!anspa:rent ;Stora...ge Ba....         FMCG
1  6946156365  ["L\x0ca\x0cb-\x0c \x0cO\x0c,n\x0c \x0cH...\x0...         FMCG
2  3865988017  ["กระปุ!กออมส?ิน AT,M กระ'ปุกออ!มสิน ,มีดนต,รี...         FMCG
3  1585616576  ["S\ni\nê,\nu\n \n!S\ni\nm.\n \nD\n:a\nt\na.\n...           EL
4   733610874  ['Q\x0bu\x0bầ?\x0bn\x0b \x0b;J\x0be\x0ba.\x0bn...      Fashion
           id                                       product_name category_lv0
0  3373941853    Adorn by Calmskin Blueberry Whipped Scrub 250ml         FMCG
1  5948198349  Kacamata Hitam Korean Fashion Wanita/Pria Sung...      Fashion
2    11608450  de Nature - Kapsul  Ziirzax dan Typhogell - Ob...         FMCG
3  2107761836  Samsung Galaxy A73 5G | 8GB+128GB | 8GB+256GB ...           EL
4  3251354065  NEW LABEL SKEENCARE PEELING LOTION TRIO 100ML ...         FMCG


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.27.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vo

A       
.dataci/tmp/train_data_pipeline%3Atext_aug/dd98486cc6f8d379827dc2fda4ec2ff1404e5
c71/text_aug.csv


INFO:dataci.benchmark.benchmark:Run data-centric benchmarking: type=data_augmentation, ml_task=text_classification
args: --train_dataset=/root/workspace/DataCI/.dataci/tmp/train_data_pipeline%3Atext_aug/dd98486cc6f8d379827dc2fda4ec2ff1404e5c71/text_aug.csv --test_dataset=/root/workspace/DataCI/.dataci/tmp/text_raw_val/641a430201153db41a736961eea5299e53955fdc/val.csv --model_name=bert-base-cased --id_col=id --exp_root=/root/workspace/DataCI/.dataci/benchmark --epochs=3 --batch_size=4 --learning_rate=1e-05 --logging_steps=1 --max_train_steps_per_epoch=10 --max_val_steps_per_epoch=10 --seed=42
INFO:dataci.benchmark.text_classification:PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Alibaba Cloud Linux release 3 (Soaring Falcon)  (x86_64)
GCC version: (GCC) 10.2.1 20200825 (Alibaba 10.2.1-3 2.32)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.32

Python version: 3.8.16 (defaul

           id                                       product_name category_lv0
0  2197727145  ["Ziplo:ck Se'al Tr:anspa...rent !Stora'ge Ba!...         FMCG
1  6946156365  ["Lab O.n Hai.r Ant!i Hai-r Fal...l Sha...mpoo...         FMCG
2  3865988017  ["กระปุ...กออมส,ิน AT?M กระ:ปุกออ.มสิน !มีดนต'...         FMCG
3  1585616576  ["S\x0ci\x0cê...\x0cu\x0c \x0c...S\x0ci\x0cm-\...           EL
4   733610874  ["Q\ru\rầ?\rn\r \r-J\re\ra-\rn\r \r-N\rữ\r .\r...      Fashion
           id                                       product_name category_lv0
0  3373941853    Adorn by Calmskin Blueberry Whipped Scrub 250ml         FMCG
1  5948198349  Kacamata Hitam Korean Fashion Wanita/Pria Sung...      Fashion
2    11608450  de Nature - Kapsul  Ziirzax dan Typhogell - Ob...         FMCG
3  2107761836  Samsung Galaxy A73 5G | 8GB+128GB | 8GB+256GB ...           EL
4  3251354065  NEW LABEL SKEENCARE PEELING LOTION TRIO 100ML ...         FMCG


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.27.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vo

A       
.dataci/tmp/train_data_pipeline%3Atext_aug/af24cc893f2cc12abe4f59cc658fbed982aac
b08/text_aug.csv


INFO:dataci.benchmark.benchmark:Run data-centric benchmarking: type=data_augmentation, ml_task=text_classification
args: --train_dataset=/root/workspace/DataCI/.dataci/tmp/train_data_pipeline%3Atext_aug/af24cc893f2cc12abe4f59cc658fbed982aacb08/text_aug.csv --test_dataset=/root/workspace/DataCI/.dataci/tmp/text_raw_val/641a430201153db41a736961eea5299e53955fdc/val.csv --model_name=bert-base-cased --id_col=id --exp_root=/root/workspace/DataCI/.dataci/benchmark --epochs=3 --batch_size=4 --learning_rate=1e-05 --logging_steps=1 --max_train_steps_per_epoch=10 --max_val_steps_per_epoch=10 --seed=42
INFO:dataci.benchmark.text_classification:PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Alibaba Cloud Linux release 3 (Soaring Falcon)  (x86_64)
GCC version: (GCC) 10.2.1 20200825 (Alibaba 10.2.1-3 2.32)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.32

Python version: 3.8.16 (defaul

           id                                       product_name category_lv0
0  2197727145  ["Ziplo?ck Se:al Tr'anspa'rent 'Stora...ge Ba!...         FMCG
1  6946156365  ["Lab O!n Hai,r Ant?i Hai.r Fal?l Sha?mpoo :30...         FMCG
2  3865988017  ["กระปุ?กออมส:ิน AT;M กระ.ปุกออ;มสิน :มีดนต?รี...         FMCG
3  1585616576  ['Siêu ...Sim D,ata 4...G Trọ.n Gói, 1 Nă,m Kh...           EL
4   733610874  ["Quần 'Jean :Nữ Ốn!g Loe; Lưng. Cao -Aaa J.ea...      Fashion
           id                                       product_name category_lv0
0  3373941853    Adorn by Calmskin Blueberry Whipped Scrub 250ml         FMCG
1  5948198349  Kacamata Hitam Korean Fashion Wanita/Pria Sung...      Fashion
2    11608450  de Nature - Kapsul  Ziirzax dan Typhogell - Ob...         FMCG
3  2107761836  Samsung Galaxy A73 5G | 8GB+128GB | 8GB+256GB ...           EL
4  3251354065  NEW LABEL SKEENCARE PEELING LOTION TRIO 100ML ...         FMCG


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.27.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vo

# 2. Summary

## 2.1 What is the best dataset for text classification?

In [8]:
!python dataci/command/benchmark.py ls -me=val/loss,val/acc,test/acc,test/batch_time train_data_pipeline:text_aug

Train dataset: train_data_pipeline:text_aug, Test dataset: text_raw_val@641a430
Type: data_augmentation, ML task: text_classification, Model name: bert-base-cased
Dataset version Val loss   Val acc    Test acc   Test batch_time
7095be7           1.359860   0.270833   0.333333   0.008315
f4bb07a           1.550337   0.145833   0.125000   0.008399
dd98486           1.471777   0.187500   0.208333   0.008417
af24cc8           1.578251   0.145833   0.208333   0.008138


## 2.2 What is the best data augmentation pipeline for text classification?

In [9]:
!python dataci/command/benchmark.py lsp -me=val/acc,test/acc train_data_pipeline

Type: data_augmentation, ML task: text_classification, Model name: bert-base-cased
Pipeline                       Dataset                                                                         
                               text_raw_train@f3a821d                   text_raw_train@09d026c                  
                               Val acc              Test acc             Val acc              Test acc            
train_data_pipeline@3f8daca    0.145833             0.125000             0.187500             0.208333             
train_data_pipeline@ca95b9f    0.270833             0.333333             0.145833             0.208333             
