# Use ERINE for text classification

created by [JJC@RUC](mailto:jincheng_jiang@foxmail.com)

This notebook follows [the official tutorial](https://github.com/PaddlePaddle/PaddleHub/blob/release/v2.0.0-beta/demo/text_classification/README.md) and [customized dataset tutorial](https://paddlehub.readthedocs.io/zh-cn/release-v2.1/finetune/customized_dataset.html#id11), trying to mimic the work of [Jin et al. 2024](https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CJFD&dbname=CJFDAUTO&filename=jjyj202403005) at some level

To avoid the trouble of environment set-up and GPU, you can run this on a remote sever such as `matgo.cn` (not promoting!)


In [None]:
## make sure you have paddle and paddlehub installed
## if anythong went wrong, you can run `pip install -U paddlepaddle` from your shell
import paddlehub as hub
# import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
# import csv
# import os
# import logging


In [7]:
## prepare the `ernie_tiny` model and inspect its properties
## if you load the model for the first time, it will be automatically downloaded to `/root/.paddlehub/modules/ernie_tiny` on unix-like system
## you can always specify the place for the downloaded directory, only if you remember it for later reference
testmodel = hub.Module(name='ernie_tiny', version='2.0.1', task='seq-cls', num_classes=2) # load model
print(testmodel.directory) # print directory of the model
testmodel # show structure of the model


[32m[2024-06-01 12:14:24,238] [    INFO][0m - Loading weights file from cache at /root/.paddlenlp/models/ernie-tiny/model_state.pdparams[0m
[32m[2024-06-01 12:14:24,714] [    INFO][0m - Loaded weights file from disk, setting weights to model.[0m
[32m[2024-06-01 12:14:31,469] [    INFO][0m - All model checkpoint weights were used when initializing ErnieForSequenceClassification.
[0m
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.[0m


/root/.paddlehub/modules/ernie_tiny


ErnieTiny(
  (model): ErnieForSequenceClassification(
    (ernie): ErnieModel(
      (embeddings): ErnieEmbeddings(
        (word_embeddings): Embedding(50006, 1024, padding_idx=0, sparse=False)
        (position_embeddings): Embedding(600, 1024, sparse=False)
        (token_type_embeddings): Embedding(2, 1024, sparse=False)
        (layer_norm): LayerNorm(normalized_shape=[1024], epsilon=1e-12)
        (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
      )
      (encoder): TransformerEncoder(
        (layers): LayerList(
          (0): TransformerEncoderLayer(
            (self_attn): MultiHeadAttention(
              (q_proj): Linear(in_features=1024, out_features=1024, dtype=float32)
              (k_proj): Linear(in_features=1024, out_features=1024, dtype=float32)
              (v_proj): Linear(in_features=1024, out_features=1024, dtype=float32)
              (out_proj): Linear(in_features=1024, out_features=1024, dtype=float32)
            )
            (linear1): Lin

In [None]:
## clear data
## mydata has two labels, about 180000 obs
## 8:1:1 -> train:dev:test
rawdata = pd.read_csv("/mnt/data/data.csv")
rawdata = rawdata[['label','text']]
rawdata.head()


In [9]:
# my label ratio is balanced so just using `sample` to split.
train = rawdata.sample(frac=0.8, random_state=0)
others = rawdata.drop(train.index)
test = others.sample(frac=0.5, random_state=0)
dev = others.drop(test.index)


In [11]:
path = r'/mnt/ERINE_proj/data'
train.to_csv(path+'/train.csv', sep='\t', index=False, header=True)
dev.to_csv(path+'/dev.csv', sep='\t', index=False, header=True)
test.to_csv(path+'/test.csv', sep='\t', index=False, header=True)


In [None]:
## prepare dataset
## as the official tutor said, the data can be saved in txt or csv file
## with first column label, second column text, delimited by Tab (that is '\t')
## it is recommended that you save your data in the folder 'your project name'/data
from paddlehub.datasets.base_nlp_dataset import TextClassificationDataset

class MyDataset(TextClassificationDataset):
    # 
    base_path = '/mnt/ERINE_proj/data'
    # label list
    label_list=['data', 'nondata']

    def __init__(self, tokenizer, max_seq_len: int = 128, mode: str = 'train'):
        if mode == 'train':
            data_file = 'train.csv'
        elif mode == 'test':
            data_file = 'test.csv'
        else:
            data_file = 'dev.csv'
        super().__init__(
            base_path=self.base_path,
            tokenizer=tokenizer,
            max_seq_len=max_seq_len,
            mode=mode,
            data_file=data_file,
            label_list=self.label_list,
            is_file_with_header=True)


tokenizer = testmodel.get_tokenizer()
train_dataset = MyDataset(tokenizer)
dev_dataset = MyDataset(tokenizer=tokenizer, mode='dev')
test_dataset = MyDataset(tokenizer=tokenizer, mode='test')


In [None]:
## train the model (fine-tuning)
## my dataset is rather large, so training is time-consuming(ETA 194 mins) 
## so I interupt from keyboard
## but it takes only 30/4568 batches in the first epoch for acc to increase from 0.68 to 0.83
## which is surprising.
import paddle
optimizer = paddle.optimizer.Adam(learning_rate=5e-5, parameters=testmodel.parameters()) # use Adam optimizer
trainer = hub.Trainer(testmodel, optimizer, checkpoint_dir='/mnt/ERINE_proj/test_ernie_text_cls') # setup trainer
trainer.train(train_dataset, epochs=10, batch_size=32, eval_dataset=dev_dataset, num_workers=8) # start training


# Gap to Jin et al. 2024

- may be we need a smaller training set, or larger batch size, or better GPU to speed up training
- 8-label classification
- smaller training units (setences instead of paragraphs)
- comparison with other models, using more metrics (recall, precision, F1, F.8), using `sklearn.metrics`