# Mô hình mạng thần kinh chuyển đổi (CNN) trong Xử lý ngôn ngữ tự nhiên (NLP)

## 1.Introduction
- Mã hóa và xây dựng vốn từ vựng từ dữ liệu văn bản cho trước.
- Tải các vector (onehot vector) từ đã được đào tạo trước và tạo các lớp nhúng (embedding) để tinh chỉnh
- Xây dựng mô hình và đào tạo CNN với Pytorch

## 1.1 Các papers hỗ trợ
- [Mạng thần kinh chuyển đổi để phân loại câu](https://arxiv.org/abs/1408.5882) (Kim, 2014).
- [Phân tích độ nhạy của (và Hướng dẫn dành cho người thực hành) Mạng thần kinh xoắn để phân loại câu](https://arxiv.org/abs/1510.03820) (Zhang, 2015).

In [16]:
import numpy as np
import nltk
nltk.download("all")
import torch

%matplotlib inline

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /home/mccorixa/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/mccorixa/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/mccorixa/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/mccorixa/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /home/mccorixa/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading pa

### 1.3 Download Datasets
Tập dữ liệu Movie Review (MR), một tập dữ liệu về phân cực của câu từ trong bình phẩm về phim ảnh (Pang và Lee, 2005)
Bộ dữ liệu bao gồm:
- 5331 câu tích cực
- 5331 câu tiêu cực

In [17]:
URL = 'https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'
# Download Datasets
!wget -P -w 'Data/' $URL
# Unzip
!tar xvzf 'Data/rt-polaritydata.tar.gz' -C 'Data/'

--2023-02-13 07:25:08--  http://data/
Resolving data (data)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘data’
--2023-02-13 07:25:08--  https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 487770 (476K) [application/x-gzip]
Saving to: ‘-w/rt-polaritydata.tar.gz.11’


2023-02-13 07:25:13 (202 KB/s) - ‘-w/rt-polaritydata.tar.gz.11’ saved [487770/487770]

FINISHED --2023-02-13 07:25:13--
Total wall clock time: 4.2s
Downloaded: 1 files, 476K in 2.4s (202 KB/s)
rt-polaritydata.README.1.0.txt
rt-polaritydata/rt-polarity.neg
rt-polaritydata/rt-polarity.pos


In [18]:
def load_text(path):
    with open(path, 'rb') as file:
        lines = []
        for line in file:
            orig_rev = line.decode(errors='ignore').lower().strip()
            lines.append(orig_rev)

    return lines

# Load files
neg_text = load_text('Data/rt-polaritydata/rt-polarity.neg')
pos_text = load_text('Data/rt-polaritydata/rt-polarity.pos')

print("Positive texts:", len(pos_text))
print("Negative texts:", len(neg_text))

if len(pos_text) >=1:
    print("Example positive text:", pos_text[0])
if len(neg_text) >=1:
    print("Example negative text:", neg_text[0])

# Concatenate and label data
# 0: neg text, 1: pos text
texts = np.array(neg_text + pos_text)
labels = np.array([0]*len(neg_text) + [1]*len(pos_text))


Positive texts: 5331
Negative texts: 5331
Example positive text: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
Example negative text: simplistic , silly and tedious .


### 1.4 Download GoogleNews-vectors-negative300 Word Vectors
Tải Download GoogleNews-vectors-negative300 Word Vectors
Vector Embedding đã được huấn luyện sẵn của các từ vựng

In [19]:
#!pip install wget

#import wget
#url = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
#filename = wget.download(url)

#f_in = gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb')
#f_out = open('GoogleNews-vectors-negative300.bin', 'wb')
#f_out.writelines(f_in)

### 1.5 GPU Training
Get GPU của máy để training (Chạy trên máy win và ubuntu) Mac chưa thử.

In [20]:
if torch.cuda.is_available():
    device = torch.device(0)
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: NVIDIA GeForce GTX 1660 Ti


## 2.Data Preparation
Chuẩn bị dữ liệu văn bản cho đào tạo.
[Tài liệu embedding word2vec](https://machinelearningcoban.com/tabml_book/ch_embedding/word2vec.html#equation-word2vec-softmax)
Lớp nhúng (embedding layer) như bảng thông số lấy các chỉ mục từ trong từ vựng làm vector từ đầu vào và đầu ra.
Lớp nhúng có hình dạng (N,d) trong đó N là kích thước của từ vựng trong một câu và d là kích thước nhúng.

### 2.1 Tokenize

Chia danh sách thành các chuỗi con, xây dựng vốn từ vựng và tinh chỉnh độ dài câu tối đa.
Nếu câu không có độ dài = maxlen gán thêm
Hàm mã hóa sẽ mã hóa các câu thành các con số là thứ tự của từ vựng đấy (Không theo quy luật nhưng không có ảnh hưởng đến kết quả)

In [21]:
from nltk.tokenize import word_tokenize

def tokenize(texts):
    # max_len
    max_len = 0
    tokenized_texts = []
    word2idx = {}

    # Add <pad> and <unk> tokens to the vocabulary
    # https://torchtext.readthedocs.io/en/latest/vocab.html
    # Meaning <pad> and <unk> https://github.com/nicolas-ivanov/tf_seq2seq_chatbot/issues/15
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1

    idx = 2
    for sent in texts:
        tokenized_sent = word_tokenize(sent)
        tokenized_texts.append(tokenized_sent)
        for token in tokenized_sent:
            if token not in word2idx:
                word2idx[token] = idx
                idx += 1

        max_len = max(max_len, len(tokenized_sent))

    return tokenized_texts, word2idx, max_len

def encode(tokenized_texts, word2idx, max_len):

    input_ids = []
    for tokenized_sent in tokenized_texts:
        tokenized_sent += ['<pad>'] * (max_len - len(tokenized_sent))

        input_id = [word2idx.get(token) for token in tokenized_sent]
        input_ids.append(input_id)

    return np.array(input_ids)

### 2.2. Load Pretrained Vectors
Các vector đào tạo trước của các vocab, những từ không có hoặc thiếu chưa hoàn thiện sẽ được khởi tạo ngẫu nhiên cùng độ dài và phương sai

In [22]:
from tqdm import tqdm_notebook

def load_pretrained_vectors(word2idx, filename):
    print("Loading pretrained vectors...")
    count = 0

    with open(filename, "rb") as file:
        header = file.readline()
        vocab_size, layer1_size = map(int, header.split())

        embeddings = np.random.uniform(-0.25, 0.25, (len(word2idx), layer1_size))
        embeddings[word2idx['<pad>']] = np.zeros((layer1_size,))

        binary_len = np.dtype('float32').itemsize * layer1_size

        for line in tqdm_notebook(range(vocab_size)):
            word = []
            while True:
                bch = file.readline(1)
                if bch == b' ':
                    word = b''.join(word)
                    break
                if bch != '\n':
                    word.append(bch)
            try:
                err = word
                word = str(word.decode('UTF-8').strip())
            except UnicodeDecodeError:
                print("Khong decode dc")
                print(err)
            if word in word2idx:
                count += 1
                embeddings[word2idx[word]] = np.fromstring(file.read(binary_len), dtype='float32')
            else:
                file.read(binary_len)
    print(f"There are {count} / {len(word2idx)} pretrained vectors found.")
    return embeddings

In [23]:
# Tokenize, build vocabulary, encode tokens
print("Tokenizing...\n")
tokenized_texts, word2idx, max_len = tokenize(texts)
input_ids = encode(tokenized_texts, word2idx, max_len)
print(input_ids)
# Load pretrained vectors
w2v_file = "GoogleNews-vectors-negative300.bin"
embeddings = load_pretrained_vectors(word2idx, w2v_file)
embeddings = torch.tensor(embeddings)
print(embeddings.shape)

Tokenizing...

[[    2     3     4 ...     0     0     0]
 [    8     9    10 ...     0     0     0]
 [   20     5    21 ...     0     0     0]
 ...
 [ 7335    60    24 ...     0     0     0]
 [    8     9  4739 ...     0     0     0]
 [  608    33 20279 ...     0     0     0]]
Loading pretrained vectors...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for line in tqdm_notebook(range(vocab_size)):


  0%|          | 0/3000000 [00:00<?, ?it/s]

  embeddings[word2idx[word]] = np.fromstring(file.read(binary_len), dtype='float32')


There are 15952 / 20280 pretrained vectors found.
torch.Size([20280, 300])


### 2.3. Create PyTorch DataLoader

In [24]:
from torch.utils.data import (TensorDataset, DataLoader, RandomSampler,
                              SequentialSampler)

def data_loader(train_inputs, val_inputs, train_labels, val_labels,
                batch_size=50):

    train_inputs, val_inputs, train_labels, val_labels =\
     tuple(torch.tensor(data) for data in [train_inputs, val_inputs, train_labels, val_labels])

    batch_size = 50

    train_data = TensorDataset(train_inputs, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    val_data = TensorDataset(val_inputs, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

    return train_dataloader, val_dataloader

90% of the dataset for training and 10% for validation.

In [25]:
from sklearn.model_selection import train_test_split

print(input_ids.shape)
print(labels.shape)
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    input_ids, labels, test_size=0.1, random_state=42)
print(train_inputs.shape)
print(val_inputs.shape)
train_dataloader, val_dataloader = data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50)


(10662, 62)
(10662,)
(9595, 62)
(1067, 62)


## 3. Model

CNN Architecture

The illustration of the CNN architecture that we are going to build with three filter sizes: 6, 8, 10 each of which has 100 filters.


In [26]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN_NLP(nn.Module):
    def __init__(self,
                 pretrained_embedding=None,
                 freeze_embedding=False,
                 vocab_size=None,
                 embed_dim=300,
                 filter_sizes=[6, 8, 10],
                 num_filters=[100, 100, 100],
                 num_classes=2,
                 dropout=0.5):
        Ci = 1
        super(CNN_NLP, self).__init__()
        if pretrained_embedding is not None:
            self.vocab_size, self.embed_dim = pretrained_embedding.shape
            self.embedding = nn.Embedding.from_pretrained(pretrained_embedding,
                                                          freeze=freeze_embedding)
        else:
            self.embed_dim = embed_dim
            self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                          embedding_dim=self.embed_dim,
                                          padding_idx=0,
                                          max_norm=5.0)
        self.convs = nn.ModuleList([
            nn.Conv2d(in_channels= Ci,
                      out_channels=num_filters[i],
                      kernel_size=(filter_sizes[i], embed_dim))
            for i in range(len(filter_sizes))
        ])

        self.fc = nn.Linear(np.sum(num_filters), num_classes)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, input_ids):

        x = self.embedding(input_ids).float()

        x = x.unsqueeze(1)

        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs]  # [(N, Co, W), ...]*len(Ks)

        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  # [(N, Co), ...]*len(Ks)

        x = torch.cat(x, 1)

        x = self.dropout(x)  # (N, len(Ks)*Co)
        logits = self.fc(x)  # (N, C)


        # x_pool_list = [F.max_pool1d(x_conv, kernel_size=x_conv.shape[2])
        #    for x_conv in x_conv_list]

        #x_fc = torch.cat([x_pool.squeeze(dim=2) for x_pool in x_pool_list],
        #                 dim=1)

        #logits = self.fc(self.dropout(x_fc))

        return logits

In [27]:
import torch.optim as optim

def initilize_model(pretrained_embedding=None,
                    freeze_embedding=False,
                    vocab_size=None,
                    embed_dim=300,
                    filter_sizes=[2, 3, 4],
                    num_filters=[100, 100, 100],
                    num_classes=2,
                    dropout=0.5,
                    learning_rate=0.01):
    """Instantiate a CNN model and an optimizer."""

    assert (len(filter_sizes) == len(num_filters)), "filter_sizes and \
    num_filters need to be of the same length."

    cnn_model = CNN_NLP(pretrained_embedding=pretrained_embedding,
                        freeze_embedding=freeze_embedding,
                        vocab_size=vocab_size,
                        embed_dim=embed_dim,
                        filter_sizes=filter_sizes,
                        num_filters=num_filters,
                        num_classes=2,
                        dropout=0.5)

    # Send model to `device` (GPU/CPU)
    cnn_model.to(device)

    # Instantiate Adadelta optimizer
    optimizer = optim.Adadelta(cnn_model.parameters(),
                               lr=learning_rate,
                               rho=0.95)

    return cnn_model, optimizer

In [28]:
import random
import time

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility."""

    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, optimizer, train_dataloader, val_dataloader=None, epochs=10):
    """Train the CNN model."""
    print("Summary's model: \n")
    print(model)
    # Tracking best validation accuracy
    best_accuracy = 0
    best_accuracy_train = 0

    # Start training loop
    print("Start training...\n")
    print(f"{'Epoch':^7} | {'Train Loss':^12} | {'Train Acc':^12} |  {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
    print("-"*80)

    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================

        # Tracking time and loss
        t0_epoch = time.time()
        total_loss = 0

        model.train()

        for step, batch in enumerate(train_dataloader):
            # Load batch to GPU
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)
            model.zero_grad()

            logits = model(b_input_ids)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()

            loss.backward()

            optimizer.step()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)
        preds = torch.argmax(logits, dim=1).flatten()
        train_accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        if train_accuracy > best_accuracy_train:
                best_accuracy_train = train_accuracy
        # =======================================
        #               Evaluation
        # =======================================



        if val_dataloader is not None:
            # After the completion of each training epoch, measure the model's
            # performance on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            if val_accuracy > best_accuracy:
                best_accuracy = val_accuracy

            time_elapsed = time.time() - t0_epoch
            print(f"{epoch_i + 1:^7} | {avg_train_loss:^12.6f} | {train_accuracy:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")

    print("\n")
    print(f"Training complete! Best accuracy test: {best_accuracy:.2f}%.")
    print(f"Training complete! Best accuracy train: {best_accuracy_train:.2f}%.")

def evaluate(model, val_dataloader):

    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

In [29]:
# CNN-rand: Word vectors are randomly initialized.
set_seed(42)
cnn_rand, optimizer = initilize_model(vocab_size=len(word2idx),
                                      embed_dim=300,
                                      learning_rate=0.25,
                                      dropout=0.5)
train(cnn_rand, optimizer, train_dataloader, val_dataloader, epochs=40)

Summary's model: 

CNN_NLP(
  (embedding): Embedding(20280, 300, padding_idx=0, max_norm=5.0)
  (convs): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 300), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(4, 300), stride=(1, 1))
  )
  (fc): Linear(in_features=300, out_features=2, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)
Start training...

 Epoch  |  Train Loss  |  Train Acc   |   Val Loss  |  Val Acc  |  Elapsed 
--------------------------------------------------------------------------------
   1    |   0.685044   |  60.000000   |  0.656397  |   61.13   |   1.74   
   2    |   0.633154   |  64.444444   |  0.624810  |   65.30   |   1.73   
   3    |   0.582111   |  73.333333   |  0.597517  |   67.66   |   1.76   
   4    |   0.526749   |  71.111111   |  0.572969  |   69.93   |   1.75   
   5    |   0.464492   |  73.333333   |  0.558144  |   70.47   |   1.76   
   6    |   0.397413   |  84.444444   |  

In [30]:
set_seed(42)
cnn_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                        freeze_embedding=True,
                                        learning_rate=0.25,
                                        dropout=0.5)
train(cnn_static, optimizer, train_dataloader, val_dataloader, epochs=40)

Summary's model: 

CNN_NLP(
  (embedding): Embedding(20280, 300)
  (convs): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 300), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(4, 300), stride=(1, 1))
  )
  (fc): Linear(in_features=300, out_features=2, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)
Start training...

 Epoch  |  Train Loss  |  Train Acc   |   Val Loss  |  Val Acc  |  Elapsed 
--------------------------------------------------------------------------------
   1    |   0.637786   |  80.000000   |  0.553841  |   73.48   |   0.96   
   2    |   0.508736   |  80.000000   |  0.468179  |   77.49   |   0.95   
   3    |   0.453095   |  80.000000   |  0.447818  |   80.02   |   0.95   
   4    |   0.414396   |  82.222222   |  0.443768  |   78.66   |   0.95   
   5    |   0.381587   |  68.888889   |  0.424471  |   81.29   |   0.95   
   6    |   0.348719   |  84.444444   |  0.420296  |   81.39   |   0.9

In [31]:
from torchsummary import summary
set_seed(42)
cnn_non_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
print(cnn_non_static)
train(cnn_non_static, optimizer, train_dataloader, val_dataloader, epochs=40)

CNN_NLP(
  (embedding): Embedding(20280, 300)
  (convs): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 300), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(4, 300), stride=(1, 1))
  )
  (fc): Linear(in_features=300, out_features=2, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)
Summary's model: 

CNN_NLP(
  (embedding): Embedding(20280, 300)
  (convs): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 300), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(4, 300), stride=(1, 1))
  )
  (fc): Linear(in_features=300, out_features=2, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)
Start training...

 Epoch  |  Train Loss  |  Train Acc   |   Val Loss  |  Val Acc  |  Elapsed 
--------------------------------------------------------------------------------
   1    |   0.637227   |  77.777778   |  0.551855  |   73.21   |   2.42   
   2    | 

## 5. Test Model

Let's test our CNN-non-static model on some examples.


In [None]:
def predict(text, model=cnn_non_static.to("cpu"), max_len=62):
    """Predict probability that a review is positive."""

    tokens = word_tokenize(text.lower())
    padded_tokens = tokens + ['<pad>'] * (max_len - len(tokens))
    input_id = [word2idx.get(token, word2idx['<unk>']) for token in padded_tokens]

    input_id = torch.tensor(input_id).unsqueeze(dim=0)

    logits = model.forward(input_id)

    probs = F.softmax(logits, dim=1).squeeze(dim=0)

    print(f"This review is {probs[1] * 100:.2f}% positive.")

In [None]:
predict("I really enjoyed it.")
predict("I have waited so long for this movie. I am now so satisfied and happy.")
predict("This movie is long and boring.")
predict("I don't like the ending.")

In [None]:
from torchtext.datasets import SST2

batch_size = 16

train_datapipe = SST2(split="train")
dev_datapipe = SST2(split="dev")


# Transform the raw dataset using non-batched API (i.e apply transformation line by line)
def apply_transform(x):
    return text_transform(x[0]), x[1]


train_datapipe = train_datapipe.map(apply_transform)

train_datapipe = train_datapipe.batch(batch_size)
train_datapipe = train_datapipe.rows2columnar(["token_ids", "target"])
train_dataloader = DataLoader(train_datapipe, batch_size=None)


dev_datapipe = dev_datapipe.map(apply_transform)
dev_datapipe = dev_datapipe.batch(batch_size)
dev_datapipe = dev_datapipe.rows2columnar(["token_ids", "target"])
dev_dataloader = DataLoader(dev_datapipe, batch_size=None)


In [None]:
def batch_transform(x):
    return {"token_ids": text_transform(x["text"]), "target": x["label"]}


train_datapipe = train_datapipe.batch(batch_size).rows2columnar(["text", "label"])
train_datapipe = train_datapipe.map(lambda x: batch_transform)
dev_datapipe = dev_datapipe.batch(batch_size).rows2columnar(["text", "label"])
dev_datapipe = dev_datapipe.map(lambda x: batch_transform)
print(train_datapipe)
print(dev_datapipe)