## one-hot 表示

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# 定义语料库
corpus = ['Time flies flies like an arrow.',
          'Fruit flies like a banana.']

one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus)
# 生成折叠的one-hot向量
one_hot.todense()

matrix([[1, 1, 0, 1, 0, 1, 1],
        [0, 0, 1, 1, 1, 1, 0]])

# TF-IDF 表示
反文档频率表示，如果“tetrafluoroethylene”这样罕见的词出现的频率较低，但很可能表明专利文件的性质。<br>
$IDF(w) = log (N / (n_w + 1))$  <br> $TF(w) = 某文档中某词或字出现的次数/该文档的总字数或总词数$

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(corpus).todense()
tfidf

matrix([[0.42519636, 0.42519636, 0.        , 0.60506143, 0.        ,
         0.30253071, 0.42519636],
        [0.        , 0.        , 0.57615236, 0.40993715, 0.57615236,
         0.40993715, 0.        ]])

# 文本数据增强方法



## 1. Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks(EDA)
提出了四种简单的数据增强操作，同义词替换（通过同义词表将句子中的词语进行同义词替换）、随机交换（随机交换句子的两个词语，改变语序）、随机插入（在原始句子中随机插入，句子中某一个词的同义词）和随机删除（随机删除句子中的词语）。
1. 同义词替换
从句子中随机选择n个不是停止词的单词，再将这些单词用同义词替换。
2. 随机插入
从句子中随机选择1个不是停止词的单词，在随机的一个位置插入这个同义词，重复操作n次。
3. 随机交换
随机选择两个单词并交换位置，重复操作n次
4. 随机删除
以p的概率随机删除句子中的每个词语

In [4]:
# code in eda.py
from eda import eda


ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."


# Output is a list of augmented sentences
eda(ex_text_str)

['memphis tenn four clarence day ago jon rahm was enduring the seasons worst weather stipulate on sunday at the open on his way to a closing at royal portrush which considering the wind and the rain was a sizeable show up th first round at the wgc fedex st jude invitational was another story with temperatures in the mid s and just any wind the spaniard was strokes better in a unflawed round thanks to his serious putting performance on the pga tour rahm finished with an under for a trio stroke lead which was regular more impressive considering hed never played the front nine at tpc southwind',
 'memphis tenn four days ago jon rahm was enduring closure the seasons worst weather conditions on sunday skilful along at the open on his way to a closing at royal lift atomic number portrush which considering the wind and undetermined the rain was a respectable showing thursdays first round at the wgc fedex st jude invitational was another story with temperatures in the mid s and hardly any wind

## 2. AEDA Augmentation
主要是在原始文本中随机插入一些标点符号，该方法只适用于文本分类任务<br>
1. 从1到 $1/3$的句子长度中，随机选择一个数作为插入标点符号的个数，既想每个句子中有标点符号插入，增加句子的复杂性；又不想加入太多标点符号，过于干扰句子的语义信息，并且太多噪声对模型可能有负面影响。
2. 主要有6种，“.”、“;”、“?”、“:”、“!”、“,”


In [5]:
# Code: https://github.com/akkarimi/aeda_nlp/blob/master/code/aeda.py

# Insert punction words into a given sentence with the given ratio "punc_ratio"
import random

random.seed(0)

# all punctuations
PUNCTUATIONS = ['.', ',', '!', '?', ';', ':']
PUNC_RATIO = 0.3

def insert_punctuation_marks(sentence, punc_ratio=0.3):
    words = sentence.split(' ') # split string into words
    new_line = []
    # insert range
    q = random.randint(1, int(punc_ratio * len(words) + 1))
    qs = random.sample(range(0, len(words)), q)

    for j, word in enumerate(words):
        if j in qs:
            new_line.append(PUNCTUATIONS[random.randint(0, len(PUNCTUATIONS) - 1)])
            new_line.append(word)
        else:
            new_line.append(word)
    # The string whose method is called is inserted in between each given string. The result is returned as a new string.
    new_line = ' '.join(new_line)
    return new_line


ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."
print(ex_text_str)
insert_punctuation_marks(ex_text_str)

MEMPHIS, Tenn. – Four days ago, Jon Rahm was     enduring the season’s worst weather conditions on Sunday at The     Open on his way to a closing 75 at Royal Portrush, which     considering the wind and the rain was a respectable showing.     Thursday’s first round at the WGC-FedEx St. Jude Invitational     was another story. With temperatures in the mid-80s and hardly any     wind, the Spaniard was 13 strokes better in a flawless round.     Thanks to his best putting performance on the PGA Tour, Rahm     finished with an 8-under 62 for a three-stroke lead, which     was even more impressive considering he’d never played the     front nine at TPC Southwind.


'MEMPHIS, Tenn. – Four days ago, Jon Rahm was  ;    enduring the season’s worst weather . conditions on Sunday at The  !  ?   Open on his way to a closing 75 ! at Royal ; Portrush, which    :  considering the wind and the rain was a respectable showing.   ,   Thursday’s first round ; at the WGC-FedEx St. ? Jude Invitational ?     was another ; story. With temperatures in the ! mid-80s and . hardly any     wind, the ; Spaniard was 13 strokes . better in a flawless round.     Thanks to his . best putting performance on : the ? PGA Tour, Rahm     : finished with an 8-under 62 for a : three-stroke lead, : which     . was ; even more impressive considering he’d ? never played the     front nine at TPC Southwind.'

##  3. NLPAug工具包
在这里主要使用一下基于字符的工具

In [None]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
text = 'The quick brown fox jumps over the lazy dog .'
print(text)

The quick brown fox jumps over the lazy dog .


### 1 Input and 1 Output

In [4]:
aug = nac.KeyboardAug()
augmentend_text = aug.augment(text)
print("Original:", text)
print("Augmented Text:", augmentend_text[0])

Original: The quick brown fox jumps over the lazy dog .
Augmented Text: The qjisk brown fox juHLs kv4r the lazy dog.


### 1 input N output

In [5]:
aug = nac.KeyboardAug()
# Parameter n=2
augmented_text = aug.augment(text, n=2)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick vrowm fox <umpA over the laxu dog.', 'The WuicO brown fox nuKps oce4 the lazy dog.']


### N input and N Output

In [6]:
texts = [
    'The quick brown fox jumps over the lazy dog .',
    'It is proved that augmentation is one of the anchor to success of computer vision model.'
]

aug = nac.KeyboardAug()
augmented_text = aug.augment(texts)
print("Original:")
print(texts)
print("Augmented Text:")
print(augmented_text)

Original:
['The quick brown fox jumps over the lazy dog .', 'It is proved that augmentation is one of the anchor to success of computer vision model.']
Augmented Text:
['The qIicL frlwn fox numpa over the lazy dog.', 'It is pdofed tga6 aunmeb^atioj is one of the anDhoG to success of computer v7si0n moXe:.']


## Referance

1. [文本增强方法总结](https://blog.csdn.net/Flying_sfeng/article/details/121691380)
2. [NLPAug文档](https://github.com/makcedward/nlpaug)
3. [NLPAug使用 Notebook](https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb)