# Kashgari Classification Benchmarks

- Kashgari: 2.0.0
- TensorFLow: 2.0.0

## Data and Language Models

### Corpus

We are using the in the [TNEWS'数据集下载](https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip)
in [中文任务基准测评(CLUE benchmark)](https://github.com/CLUEbenchmark/CLUE).

### Language models

Download Embeddings to Embddings Folder and unzip.
- [BERT-Base, Chinese](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)

Final folder struct is

```
.
└── embeddings
    └── chinese_L-12_H-768_A-12
```

In [1]:
# Setup macros
EMBEDDING_FOLDER = '/Users/brikerman/Desktop/kashgari-demo/embeddings'
EARL_STOPPING_PATIENCE = 5
REDUCE_RL_PATIENCE = 5

EPOCHS = 30

In [2]:
import os
from tensorflow.keras.utils import get_file
from kashgari.macros import DATA_PATH

# Download data to `~/.kashgari/tnews_public`
get_file('tnews_public.zip',
         'https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip',
         cache_subdir='tnews_public',
         cache_dir=DATA_PATH,
         extract=True)

corpus_path = os.path.join(DATA_PATH, 'tnews_public')

Downloading data from https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip


## Preprocess Dataset

We will split `train.json` dataset to train and valid dataset by 8:2 rate, and use the `dev.json` as testset.
This is because the `test.json` is unlabeled data, and we can't use it as testset.

In [3]:
import json
from typing import List, Tuple, Dict
from kashgari.tokenizers import BertTokenizer
from sklearn.model_selection import train_test_split

tokenizer = BertTokenizer()

def parse_data_file(file_path: str) -> Tuple[List[List[str]], List[str]]:
    x_set: List[List[str]] = []
    y_set: List[str] = []
    with open(file_path, 'r') as f:
        for line in f.readlines():
            sample = json.loads(line)
            x = tokenizer.tokenize(sample['sentence'])
            y = sample['label_desc'].replace('news_', '')
            x_set.append(x)
            y_set.append(y)
    return x_set, y_set


train_json_x, train_json_y = parse_data_file(os.path.join(corpus_path, 'train.json'))
test_x, test_y = parse_data_file(os.path.join(corpus_path, 'dev.json'))

train_x, valid_x, train_y, valid_y = train_test_split(train_json_x, train_json_y,
                                                      test_size=0.2,
                                                      random_state=42)

print(f'Train samples : {len(train_x)}')
print(f'Valid samples : {len(valid_x)}')
print(f'Test  samples : {len(test_x)}')

Train samples : 42688
Valid samples : 10672
Test  samples : 10000


In [4]:
from kashgari.tasks.classification import BiGRU_Model, BiLSTM_Model
from kashgari.tasks.classification import CNN_Model, CNN_Attention_Model
from kashgari.tasks.classification import CNN_GRU_Model, CNN_LSTM_Model
from kashgari.embeddings import BertEmbedding

In [5]:
# Google Bert
bert_chinese = BertEmbedding(os.path.join(EMBEDDING_FOLDER, 'chinese_L-12_H-768_A-12'))

embeddings = [
    None,
    bert_chinese,
]

model_classes_list = [
    BiGRU_Model,
    BiLSTM_Model,
    CNN_Model,
    CNN_Attention_Model,
    CNN_GRU_Model,
    CNN_LSTM_Model
]

In [6]:
import glob
import time
import tensorflow as tf
from tensorflow import keras
from kashgari.callbacks import EvalCallBack
from benchmark_utils import  BenchMarkHelper

run_count = glob.glob('./tf_dir/classification/run_*')
TF_LOG_FOLDER = f'./tf_dir/classification/run_{len(run_count)}'
TRAINING_LOG = f'./training_logs_{len(run_count)}.json'


for embed in embeddings:
    for MODEL_CLASS in model_classes_list:
        model_name = MODEL_CLASS.__name__
        if embed:
            embed_name = embed.__class__.__name__
        else:
            embed_name = 'Bare'
        run_name = f"{embed_name}-{model_name}"
        
        start_at = time.time()
        
        model = MODEL_CLASS(embed)
        model.fit(train_x, train_y, epochs=1)

        early_stop = keras.callbacks.EarlyStopping(patience=EARL_STOPPING_PATIENCE)
        reduse_lr_callback = keras.callbacks.ReduceLROnPlateau(factor=0.1,
                                                               patience=REDUCE_RL_PATIENCE)

        eval_callback = EvalCallBack(model,
                                     test_x,
                                     test_y,
                                     step=1)

        tf_board = keras.callbacks.TensorBoard(
            log_dir=os.path.join(TF_LOG_FOLDER, run_name),
        )
        
        file_writer = tf.summary.create_file_writer(os.path.join(TF_LOG_FOLDER, run_name))
        file_writer.set_as_default()

        callbacks = [early_stop, reduse_lr_callback, eval_callback, tf_board]

        model.fit(train_x,
                  train_y,
                  valid_x,
                  valid_y,
                  callbacks=callbacks,
                  epochs=EPOCHS)
        
        BenchMarkHelper.save_training_logs(TRAINING_LOG,
                                           embedding_name=embed_name,
                                           model_name=model_name,
                                           logs=eval_callback.logs,
                                           training_duration=time.time()-start_at)

Preparing text vocab dict: 100%|██████████| 42688/42688 [00:00<00:00, 202363.82it/s]
Preparing classification label vocab dict: 100%|██████████| 42688/42688 [00:00<00:00, 1088699.61it/s]
Calculating sequence length: 100%|██████████| 42688/42688 [00:00<00:00, 1152884.68it/s]


Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, None)]            0         
_________________________________________________________________
layer_embedding (Embedding)  (None, None, 100)         436200    
_________________________________________________________________
bidirectional (Bidirectional (None, 256)               176640    
_________________________________________________________________
dense (Dense)                (None, 15)                3855      
_________________________________________________________________
activation (Activation)      (None, 15)                0         
Total params: 616,695
Trainable params: 616,695
Non-trainable params: 0
_________________________________________________________________
Train for 667 steps
Model: "model_1"
_________________________________________________________________
Layer (type)    







               precision    recall  f1-score   support

  agriculture     0.4462    0.4696    0.4576       494
          car     0.6245    0.5992    0.6116       791
      culture     0.3319    0.6101    0.4299       736
          edu     0.5377    0.5186    0.5280       646
entertainment     0.5155    0.4758    0.4949       910
      finance     0.4451    0.5042    0.4728       956
         game     0.6018    0.4977    0.5449       659
        house     0.6124    0.4974    0.5489       378
     military     0.4932    0.4525    0.4720       716
       sports     0.6491    0.6415    0.6452       767
        stock     0.0000    0.0000    0.0000        45
        story     0.6667    0.0093    0.0183       215
         tech     0.5118    0.3976    0.4475      1089
       travel     0.3529    0.4343    0.3894       693
        world     0.4549    0.4287    0.4414       905

     accuracy                         0.4861     10000
    macro avg     0.4829    0.4358    0.4335     10000
 weighte

  _warn_prf(average, modifier, msg_start, len(result))



  agriculture     0.4844    0.4393    0.4607       494
          car     0.5547    0.6410    0.5947       791
      culture     0.4074    0.4810    0.4411       736
          edu     0.4507    0.6084    0.5178       646
entertainment     0.5981    0.4187    0.4926       910
      finance     0.4142    0.5628    0.4772       956
         game     0.6409    0.4901    0.5555       659
        house     0.6082    0.5132    0.5567       378
     military     0.4457    0.6131    0.5162       716
       sports     0.6421    0.6362    0.6392       767
        stock     0.0000    0.0000    0.0000        45
        story     0.7143    0.0930    0.1646       215
         tech     0.4916    0.4298    0.4586      1089
       travel     0.3416    0.5195    0.4121       693
        world     0.5531    0.2188    0.3135       905

     accuracy                         0.4880     10000
    macro avg     0.4898    0.4443    0.4400     10000
 weighted avg     0.5099    0.4880    0.4802     10000


epoch:





ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/Users/brikerman/Desktop/python/Kashgari2/venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 681, in on_epoch
    yield epoch_logs
  File "/Users/brikerman/Desktop/python/Kashgari2/venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
    total_epochs=epochs)
  File "/Users/brikerman/Desktop/python/Kashgari2/venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/Users/brikerman/Desktop/python/Kashgari2/venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "/Users/brikerman/Desktop/python/Kashgari2/venv/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)

TypeError: can only concatenate str (not "list") to str