Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add JiebaTokenizer demo #4747

Merged
merged 1 commit into from
Jul 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions PaddleNLP/similarity_net/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
| 模型 | 百度知道 | ECOM |QQSIM | UNICOM |
|:-----------:|:-------------:|:-------------:|:-------------:|:-------------:|
| | AUC | AUC | AUC|正逆序比|
|BOW_Pairwise|0.6767|0.7329|0.7650|1.5630|
|BOW_Pairwise(WordSeg)|0.6767|0.7329|0.7650|1.5630|
|BOW_Pairwise(Jieba)|0.6658|0.7351|0.8431|1.5331|
#### 测试集说明
| 数据集 | 来源 | 垂类 |
|:-----------:|:-------------:|:-------------:|
Expand Down Expand Up @@ -51,7 +52,10 @@ python download.py model
```

#### 评估
我们公开了自建的测试集,包括百度知道、ECOM、QQSIM、UNICOM四个数据集,基于上面的预训练模型,用户可以进入evaluate目录下依次执行下列命令获取测试集评估结果。
我们公开了自建的测试集,包括百度知道、ECOM、QQSIM、UNICOM 四个数据集,基于上面的预训练模型,用户可以进入 evaluate 目录下依次执行下列命令获取测试集评估结果。

我们在以下评估脚本中以 Jieba 切词作为示例,如果您需要自定义切词模块,只需要在 [`tokenization.py`](tokenization.py) 中参考 `JiebaTokenizer` 实现自定义的切词类, 并且在 `evaluate_*.sh` 评估脚本中配置环境变量 `TOKENIZER=${YOUR_TOKENIZER_NAME}` 即可, 如果 `TOKENIZER` 环境变量为空, 则默认输入数据是切词后的数据(示例给出的数据是百度切词工具 WordSeg 切词后的数据)

```shell
sh evaluate_ecom.sh
sh evaluate_qqsim.sh
Expand Down
4 changes: 2 additions & 2 deletions PaddleNLP/similarity_net/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,8 +96,8 @@ def download(url, filename, md5sum):

def download_dataset(dir_path):
BASE_URL = "https://baidu-nlp.bj.bcebos.com/"
DATASET_NAME = "simnet_dataset-1.0.0.tar.gz"
DATASET_MD5 = "ec65b313bc237150ef536a8d26f3c73b"
DATASET_NAME = "simnet_dataset-1.0.1.tar.gz"
DATASET_MD5 = "4a381770178721b539e7cf0f91a8777d"
file_path = os.path.join(dir_path, DATASET_NAME)
url = BASE_URL + DATASET_NAME

Expand Down
7 changes: 3 additions & 4 deletions PaddleNLP/similarity_net/download_data.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
#get data
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/simnet_dataset-1.0.0.tar.gz
tar xzf simnet_dataset-1.0.0.tar.gz
rm simnet_dataset-1.0.0.tar.gz

wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/simnet_dataset-1.0.1.tar.gz
tar xzf simnet_dataset-1.0.1.tar.gz
rm simnet_dataset-1.0.1.tar.gz
11 changes: 10 additions & 1 deletion PaddleNLP/similarity_net/evaluate/evaluate_ecom.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,21 @@ export FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=3
export FLAGS_fraction_of_gpu_memory_to_use=0.95
TASK_NAME='simnet'
TEST_DATA_PATH=./data/ecom
VOCAB_PATH=./data/term2id.dict
CKPT_PATH=./model_files
TEST_RESULT_PATH=./evaluate/ecom_test_result
TASK_MODE='pairwise'
CONFIG_PATH=./config/bow_pairwise.json
INIT_CHECKPOINT=./model_files/simnet_bow_pairwise_pretrained_model/

# use JiebaTokenizer to evaluate
TOKENIZER="JiebaTokenizer"
TEST_DATA_PATH=./data/ecom_raw

# use tokenized data by WordSeg to evaluate
#TOKENIZER=""
#TEST_DATA_PATH=./data/ecom

cd ..

python ./run_classifier.py \
Expand All @@ -23,5 +31,6 @@ python ./run_classifier.py \
--test_result_path ${TEST_RESULT_PATH} \
--config_path ${CONFIG_PATH} \
--vocab_path ${VOCAB_PATH} \
--tokenizer ${TOKENIZER:-""} \
--task_mode ${TASK_MODE} \
--init_checkpoint ${INIT_CHECKPOINT}
11 changes: 10 additions & 1 deletion PaddleNLP/similarity_net/evaluate/evaluate_qqsim.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,21 @@ export FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=3
export FLAGS_fraction_of_gpu_memory_to_use=0.95
TASK_NAME='simnet'
TEST_DATA_PATH=./data/qqsim
VOCAB_PATH=./data/term2id.dict
CKPT_PATH=./model_files
TEST_RESULT_PATH=./evaluate/qqsim_test_result
TASK_MODE='pairwise'
CONFIG_PATH=./config/bow_pairwise.json
INIT_CHECKPOINT=./model_files/simnet_bow_pairwise_pretrained_model/

# use JiebaTokenizer to evaluate
TOKENIZER="JiebaTokenizer"
TEST_DATA_PATH=./data/qqsim_raw

# use tokenized data by WordSeg to evaluate
#TOKENIZER=""
#TEST_DATA_PATH=./data/qqsim

cd ..

python ./run_classifier.py \
Expand All @@ -23,5 +31,6 @@ python ./run_classifier.py \
--test_result_path ${TEST_RESULT_PATH} \
--config_path ${CONFIG_PATH} \
--vocab_path ${VOCAB_PATH} \
--tokenizer ${TOKENIZER:-""} \
--task_mode ${TASK_MODE} \
--init_checkpoint ${INIT_CHECKPOINT}
11 changes: 9 additions & 2 deletions PaddleNLP/similarity_net/evaluate/evaluate_unicom.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,21 @@ export FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=3
export FLAGS_fraction_of_gpu_memory_to_use=0.95
TASK_NAME='simnet'
INFER_DATA_PATH=./evaluate/unicom_infer
VOCAB_PATH=./data/term2id.dict
CKPT_PATH=./model_files
INFER_RESULT_PATH=./evaluate/unicom_infer_result
TASK_MODE='pairwise'
CONFIG_PATH=./config/bow_pairwise.json
INIT_CHECKPOINT=./model_files/simnet_bow_pairwise_pretrained_model/

# use JiebaTokenizer to evaluate
TOKENIZER="JiebaTokenizer"
INFER_DATA_PATH=./data/unicom_infer_raw

# use tokenized data by WordSeg to evaluate
#TOKENIZER=""
#INFER_DATA_PATH=./evaluate/unicom_infer

python unicom_split.py
cd ..
python ./run_classifier.py \
Expand All @@ -23,8 +30,8 @@ python ./run_classifier.py \
--infer_result_path ${INFER_RESULT_PATH} \
--config_path ${CONFIG_PATH} \
--vocab_path ${VOCAB_PATH} \
--tokenizer ${TOKENIZER:-""} \
--task_mode ${TASK_MODE} \
--init_checkpoint ${INIT_CHECKPOINT}
cd evaluate
python unicom_compute_pos_neg.py

11 changes: 10 additions & 1 deletion PaddleNLP/similarity_net/evaluate/evaluate_zhidao.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,21 @@ export FLAGS_sync_nccl_allreduce=1
export CUDA_VISIBLE_DEVICES=3
export FLAGS_fraction_of_gpu_memory_to_use=0.95
TASK_NAME='simnet'
TEST_DATA_PATH=./data/zhidao
VOCAB_PATH=./data/term2id.dict
CKPT_PATH=./model_files
TEST_RESULT_PATH=./evaluate/zhidao_test_result
TASK_MODE='pairwise'
CONFIG_PATH=./config/bow_pairwise.json
INIT_CHECKPOINT=./model_files/simnet_bow_pairwise_pretrained_model/

# use JiebaTokenizer to evaluate
TOKENIZER="JiebaTokenizer"
TEST_DATA_PATH=./data/zhidao_raw

# use tokenized data by WordSeg to evaluate
#TOKENIZER=""
#TEST_DATA_PATH=./data/zhidao

cd ..

python ./run_classifier.py \
Expand All @@ -23,5 +31,6 @@ python ./run_classifier.py \
--test_result_path ${TEST_RESULT_PATH} \
--config_path ${CONFIG_PATH} \
--vocab_path ${VOCAB_PATH} \
--tokenizer ${TOKENIZER:-""} \
--task_mode ${TASK_MODE} \
--init_checkpoint ${INIT_CHECKPOINT}
45 changes: 45 additions & 0 deletions PaddleNLP/similarity_net/reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import numpy as np
import io

import tokenization

class SimNetProcessor(object):
def __init__(self, args, vocab):
Expand All @@ -27,6 +28,10 @@ def __init__(self, args, vocab):
self.vocab = vocab
self.valid_label = np.array([])
self.test_label = np.array([])
if args.tokenizer:
self.tokenizer = getattr(tokenization, args.tokenizer)()
else:
self.tokenizer = None

def get_reader(self, mode, epoch=0):
"""
Expand All @@ -48,6 +53,12 @@ def reader_with_pairwise():
logging.warning(
"line not match format in test file")
continue

# tokenize
if self.tokenizer:
query = self.tokenizer.tokenize(query)
title = self.tokenizer.tokenize(title)

query = [
self.vocab[word] for word in query.split(" ")
if word in self.vocab
Expand All @@ -71,6 +82,12 @@ def reader_with_pairwise():
logging.warning(
"line not match format in test file")
continue

# tokenize
if self.tokenizer:
query = self.tokenizer.tokenize(query)
title = self.tokenizer.tokenize(title)

query = [
self.vocab[word] for word in query.split(" ")
if word in self.vocab
Expand All @@ -95,6 +112,12 @@ def reader_with_pairwise():
logging.warning(
"line not match format in test file")
continue
# tokenize
if self.tokenizer:
query = self.tokenizer.tokenize(query)
pos_title = self.tokenizer.tokenize(pos_title)
neg_title = self.tokenizer.tokenize(neg_title)

query = [
self.vocab[word] for word in query.split(" ")
if word in self.vocab
Expand Down Expand Up @@ -130,6 +153,12 @@ def reader_with_pointwise():
logging.warning(
"line not match format in test file")
continue

# tokenize
if self.tokenizer:
query = self.tokenizer.tokenize(query)
title = self.tokenizer.tokenize(title)

query = [
self.vocab[word] for word in query.split(" ")
if word in self.vocab
Expand All @@ -153,6 +182,12 @@ def reader_with_pointwise():
logging.warning(
"line not match format in test file")
continue

# tokenize
if self.tokenizer:
query = self.tokenizer.tokenize(query)
title = self.tokenizer.tokenize(title)

query = [
self.vocab[word] for word in query.split(" ")
if word in self.vocab
Expand All @@ -178,6 +213,12 @@ def reader_with_pointwise():
logging.warning(
"line not match format in test file")
continue

# tokenize
if self.tokenizer:
query = self.tokenizer.tokenize(query)
title = self.tokenizer.tokenize(title)

query = [
self.vocab[word] for word in query.split(" ")
if word in self.vocab
Expand Down Expand Up @@ -208,6 +249,10 @@ def get_infer_reader(self):
if len(query) == 0 or len(title) == 0:
logging.warning("line not match format in test file")
continue
# tokenize
if self.tokenizer:
query = self.tokenizer.tokenize(query)
title = self.tokenizer.tokenize(title)
query = [
self.vocab[word] for word in query.split(" ")
if word in self.vocab
Expand Down
33 changes: 33 additions & 0 deletions PaddleNLP/similarity_net/tokenization.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Tokenization classes."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import jieba

class JiebaTokenizer(object):
"""Runs end-to-end tokenziation."""

def __init__(self):
# Todo:
pass

def tokenize(self, text):
split_tokens = jieba.cut(text)
split_tokens = " ".join([word for word in split_tokens])
return split_tokens
1 change: 1 addition & 0 deletions PaddleNLP/similarity_net/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,7 @@ def __init__(self):
data_g.add_arg("infer_data_dir", str, None,
"Directory path to infer data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("tokenizer", str, None, "Whether or not use user defined tokenizer")
data_g.add_arg("batch_size", int, 32,
"Total examples' number in batch for training.")

Expand Down