add data augmentation strategy #2805

lugimzzz · 2022-07-14T12:46:55Z

PR types

Others

PR changes

APIs

Description

新增基于词表的词级别替换(基于同义词、同音词、随机词、本地词表、组合词表)、删除(随机)、插入(基于同义词、同音词、本地词表、随机词、组合词表)、交换(随机)的数据增强策略。

模型功能：

from paddlenlp.data_augmentation.word import WordSubstitute, WordDelete, WordSwap, WordInsert
s1 = '2021年，我再看深度学习领域，无论是自然语言处理、音频信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个代号：Transformer。'
s2 = '绝对准确率计算的是完全预测正确的样本占总样本数的比例，而0-1损失计算的是完全预测错误的样本占总样本的比例。'

# create_n:选择数据增强句子数量；aug_n：选择替换单词数量
# 同义词替换
aug = WordSubstitute('synonym', create_n=2, aug_n=2)
augmented = aug.augment(s1)
print(augmented)
# ['2021年，我再看深度习世界，无论是自然语言处理、音频信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个代号：Transformer。', '2021年，我再看吃水学习领域，无论是自然语言处理、音频信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个调号：Transformer。']

# 支持输入list
augmented = aug.augment([s1,s2])
print(augmented)
# [['2021年，我再看深度学习领域，无论自然语言处理、音频信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个商标：Transformer。', '2021年，我再看深度学习领域，听由自然语言处理、音频信号处理、图像处理、引进系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个代号：Transformer。'], ['绝对准确率乘除的是完全预测正确的样本占总样本数的比例，而0-1损失算计的是完全预测错误的样本占总样本的比例。', '绝对准确率计算的是完全预测正确的样本占总样本数的比重，而0-1损失算计的是完全预测错误的样本占总样本的比例。']]

# aug_percent：选择替换单词数量百分比
aug = WordSubstitute('synonym', create_n=2, aug_percent=0.1)
augmented = aug.augment(s1)
print(augmented)
# ['2021年景，我再看深度读书领域，管自是语言处理、板眼信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个商标：Transformer。', '2021年成，我再看深度学习园地，无论是自然言语处理、点子信号处理、图像处理、举荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个调号：Transformer。']

# 同音词替换
aug = WordSubstitute('homonym', create_n=2, aug_n=2)
augmented = aug.augment(s1)
print(augmented)
# ['2021年，我再看深度学习领域，无论是自然语言处理、音频新好处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个带好：Transformer。', '2021年，我再看深度学习领域，无论是自然语言处理、音频信号处理、图像处理、推荐系统，似乎都坎到attention混得风生水起，只不过更多时候看到的是它的另一个戴皓：Transformer。']


# 随机词替换
aug = WordSubstitute('random', create_n=2, aug_n=2)
augmented = aug.augment(s1)
print(augmented)
# ['原和年，我再看深度学习领域，无论是自然语言处理、音频信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个代号：任总ansformer。', '责权年，我再看深度学习领域，无论是自然语言处理、音频信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个代号：5.39ansformer。']

# 本地词表替换
aug = WordSubstitute('custom', custom_file_path='data', create_n=2, aug_n=2)
augmented = aug.augment(s1)
print(augmented)

# 组合词表替换,组合词表不支持随机词
aug = WordSubstitute(['homonym',  'synonym'], custom_file_path='data', create_n=2, aug_n=2)
augmented = aug.augment(s1)
print(augmented)
# ['2021年，我再看深度学习领域，无论是本语言处理、音频信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个待好：Transformer。', '2021年，我再看深度学习灵玉，无论是自然语言处理、音频信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个商标：Transformer。']

# 随机删除
aug = WordDelete(create_n=2, aug_n=2)
augmented = aug.augment(s1)
print(augmented)

# 随机交换
aug = WordSwap(create_n=2, aug_n=2)
augmented = aug.augment(s1)
print(augmented)

# 同义词交换，此外与替换一样，还支持同音词、本地词表、随机词、组合词表多种方式
aug = WordInsert('synonym', create_n=2, aug_n=2)
augmented = aug.augment(s1)
print(augmented)

wawltor · 2022-07-18T06:25:23Z

paddlenlp/data_augmentation/__init__.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .base_augment import *


整体缺少一个使用的说明文档，可以在整体工作完备之后出一个文档

好的，开发完成后会补充文档和使用教程

wawltor · 2022-07-18T06:36:04Z

paddlenlp/data_augmentation/base_augment.py

+            ("word_homonym.json", "a578c04201a697e738f6a1ad555787d5",
+             "https://bj.bcebos.com/paddlenlp/data/word_homonym.json")
+        }
+        self.stop_words = self._get_data('stop_words')


这些变量定义和函数可以统一标准，如果是内置变量，不希望被访问，可以 _ 开头变成一个半私有变量

将不希望访问的变量改为单下划线开头，如self.DATA->self._DATA

wawltor · 2022-07-18T06:38:41Z

paddlenlp/data_augmentation/base_augment.py

+        '''Calculate number of words for data augmentation'''
+        if size == 0:
+            return 0
+        aug_percent = self.aug_percent or 0.02


这块是不是在类初始化的时候aug_percent默认就是0.02

已修改默认初始化0.02

wawltor · 2022-07-18T06:39:47Z

paddlenlp/data_augmentation/base_augment.py

+                 aug_percent=None,
+                 aug_min=1,
+                 aug_max=10):
+        paddle.set_device("cpu")


这里没有太明白为什么paddle.set_device('cpu')

wawltor · 2022-07-18T06:42:11Z

paddlenlp/data_augmentation/base_augment.py

+        '''Read data as list '''
+        fullname = self._load_file(mode)
+        data = []
+        with open(fullname, 'r', encoding='utf-8') as f:


这里要不要验证一下这个file是否存在

已添加if os.path.exists(fullname):

wawltor · 2022-07-18T06:42:27Z

paddlenlp/data_augmentation/base_augment.py

+
+    def _generate_random_index(self, seq_tokens, skip=True):
+        '''Random sample words for insertion/deletion/swap'''
+        # skip stopping words


skip -> Skip

wawltor · 2022-07-18T06:44:08Z

paddlenlp/data_augmentation/base_augment.py

+        aug_n = min(aug_n, len(indexes))
+        return random.sample(indexes, aug_n)
+
+    def augment(self, sequences, num_thread=1):


这个函数是一个对外public函数，需要把函数和参数写清楚

wawltor · 2022-07-18T06:49:19Z

paddlenlp/data_augmentation/base_augment.py

+                for sequence in sequences:
+                    output.append(self._augment(sequence))
+                return output
+        # TO BE DONE: Multi Thread


多进程确实要考虑一下，目前windows的多进程需要多测试一下，有些坑 https://segmentfault.com/a/1190000013681586

wawltor · 2022-07-18T06:50:04Z

paddlenlp/data_augmentation/word/word_delete.py

+# limitations under the License.
+import random
+
+from paddlenlp.data_augmentation import BaseAugment


如果是library里面，建议import关系，直接import 类似这种 ..data_augmentation import BaseAugment

已修改为from ..data_augmentation import BaseAugment

wawltor · 2022-07-18T06:52:51Z

paddlenlp/data_augmentation/word/word_delete.py

+        return indexes
+
+
+if __name__ == '__main__':


library里面的函数不建议加main函数

wawltor · 2022-07-18T06:54:45Z

paddlenlp/data_augmentation/word/word_delete.py

+        seq_tokens = self.tokenizer.cut(sequence)
+        aug_n = self._get_aug_n(len(seq_tokens))
+        aug_indexes = self.skip_words(seq_tokens)
+        aug_n = min(aug_n, len(aug_indexes))


这块需要讨论一下，如果字符串skip words之后，剩下的字符过少时，是不是就不用skip了

加入策略，被增强的词数量aug_n不得大于len(aug_indexes)*0.3，也即至少每四个词才能有一个词使用数据增强策略

def _get_aug_n(self, size, size_a=None): if size == 0: return 0 aug_n = self.aug_n or int(math.ceil(self.aug_percent * size)) if self.aug_min and aug_n < self.aug_min: aug_n = self.aug_min elif self.aug_max and aug_n > self.aug_max: aug_n = self.aug_max if size_a is not None: aug_n = min(aug_n, int(math.floor(size_a*0.3))) return aug_n

wawltor · 2022-07-18T07:01:49Z

paddlenlp/data_augmentation/word/word_insert.py

+            fullname = self.custom_file_path
+        elif source_type in ['delete']:
+            fullname = self.delete_file_path
+        with open(fullname, 'r', encoding='utf-8') as f:


判断一下文件是否存在吧

已加if os.path.exists(fullname):

wawltor · 2022-07-18T07:04:23Z

paddlenlp/data_augmentation/word/word_insert.py

+
+
+if __name__ == '__main__':
+    aug = WordInsert(aug_type='synonym', create_n=2, aug_n=1)


main函数去掉

wawltor · 2022-07-18T07:08:39Z

paddlenlp/data_augmentation/word/word_insert.py

+                    idxes = random.sample(list(range(len(candidate_tokens))),
+                                          aug_n)
+                    aug_tokens = []
+                    for idx in idxes:


这块不整体sample一次，单次一次次sample，random.sample这块的耗时可能是一个性能瓶颈

random对速度的影响，开发完成后进行数据测评再选择优化方案

wawltor · 2022-07-18T07:08:50Z

paddlenlp/data_augmentation/word/word_insert.py

+                aug_indexes = random.sample(aug_indexes, aug_n)
+                for aug_index in aug_indexes:
+                    token = self.vocab.to_tokens(
+                        random.randint(0,


wawltor · 2022-07-18T07:10:49Z

paddlenlp/data_augmentation/word/word_insert.py

+                self._reverse_sequence(seq_tokens.copy(), [aug_token]))
+        return sentences
+
+    def _reverse_sequence(self, output_seq_tokens, aug_tokens):


这个函数名字为啥叫reverse了，好像和反转没有关系

函数改名为_generate_sequence

wawltor · 2022-07-18T07:11:24Z

paddlenlp/data_augmentation/word/word_swap.py

+    aug = WordSwap(create_n=2, aug_n=1)
+    s1 = '2021年，我再看深度学习领域，无论是自然语言处理、音频信号处理、图像处理、推荐系统，似乎都看到attention混得风生水起，只不过更多时候看到的是它的另一个代号：Transformer。'
+
+    augmented = aug.augment(s1)


wawltor · 2022-07-18T07:13:15Z

paddlenlp/data_augmentation/word/word_substitute.py

+        if source_type in ['synonym', 'homonym']:
+            fullname = self._load_file('word_' + source_type)
+        elif source_type in ['custom']:
+            fullname = self.custom_file_path


已加if os.path.exists(fullname):

wawltor · 2022-07-18T07:14:04Z

paddlenlp/data_augmentation/word/word_substitute.py

+        elif source_type in ['delete']:
+            fullname = self.delete_file_path
+        with open(fullname, 'r', encoding='utf-8') as f:
+            substitue_dict = json.load(f)


判断一下file是否存在

已加if os.path.exists(fullname):

wawltor · 2022-07-18T07:14:39Z

paddlenlp/data_augmentation/word/word_substitute.py

+                aug_indexes = random.sample(aug_indexes, aug_n)
+                for aug_index in aug_indexes:
+                    token = self.vocab.to_tokens(
+                        random.randint(0,


ZHUI · 2022-07-18T13:12:40Z

paddlenlp/data_augmentation/base_augment.py

+import paddle
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+from paddlenlp.utils.env import DATA_HOME


这里paddlenlp的内容，看要不要用相对路径导入

已修改

from ..utils.env import DATA_HOME from ..data import Vocab, JiebaTokenizer

ZHUI · 2022-07-18T13:14:02Z

paddlenlp/data_augmentation/base_augment.py

+                 aug_percent=None,
+                 aug_min=1,
+                 aug_max=10):
+        paddle.set_device("cpu")


注意状态恢复，目前dataloader中，是默认 cpu 环境

ZHUI · 2022-07-18T13:15:48Z

paddlenlp/data_augmentation/word/word_delete.py

+
+
+if __name__ == '__main__':
+    aug = WordDelete(create_n=10, aug_n=1)


同泽阳，可以考虑放单测中

wawltor

LGTM

add_data_augmentation_strategy

228f99a

lugimzzz added the data augmentation label Jul 14, 2022

lugimzzz requested a review from wawltor July 14, 2022 12:46

lugimzzz self-assigned this Jul 14, 2022

lugimzzz added 2 commits July 15, 2022 15:46

add_data_augmentation_strategy

48d048c

add_data_augmentation_strategy

abc39c2

wawltor reviewed Jul 18, 2022

View reviewed changes

ZHUI reviewed Jul 18, 2022

View reviewed changes

lugimzzz and others added 2 commits July 19, 2022 08:29

add_data_augmentation_strategy

a925c03

Merge branch 'develop' into data_augmentation

4f6f363

wawltor approved these changes Jul 28, 2022

View reviewed changes

wawltor merged commit 842954c into PaddlePaddle:develop Jul 28, 2022

lugimzzz deleted the data_augmentation branch July 28, 2022 09:07

lugimzzz restored the data_augmentation branch July 28, 2022 09:07

lugimzzz mentioned this pull request Aug 1, 2022

PaddleNLP 2.3.5 Release Note Candidate #2907

Closed

lugimzzz deleted the data_augmentation branch September 19, 2022 04:52



		if __name__ == '__main__':
		aug = WordInsert(aug_type='synonym', create_n=2, aug_n=1)



		if __name__ == '__main__':
		aug = WordDelete(create_n=10, aug_n=1)

add data augmentation strategy #2805

add data augmentation strategy #2805

Conversation

lugimzzz commented Jul 14, 2022 • edited Loading

PR types

PR changes

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lugimzzz Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lugimzzz Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wawltor left a comment

Choose a reason for hiding this comment

lugimzzz commented Jul 14, 2022 •

edited

Loading

lugimzzz Jul 19, 2022 •

edited

Loading

lugimzzz Jul 19, 2022 •

edited

Loading