# Data Augmentationについて

In [1]:
import torch
import pandas as pd
import transformers

## 1. データの準備

In [6]:
import preprocessing
train_df, val_df, lookups = utils.preprocessing.read_kfold_file("../data", fold=7, n_splits=10)

ModuleNotFoundError: No module named 'utils'

一部のデータだけを使ってみる

In [4]:
test_docs = val_df.loc[[134, 156, 356, 1088, 1234]].reset_index(drop=True).copy()
test_docs

NameError: name 'val_df' is not defined

In [4]:
test_docs = test_docs.filter(["text", "annotation"])

In [5]:
print("text.split()の長さ", len(test_docs.iloc[2]["text"].split()))
print("annotationの長さ", len(test_docs.iloc[2]["annotation"]))
print("改行キー(\\n)が存在するか:", "\n" in test_docs.iloc[3][0].split())

text.split()の長さ 358
annotationの長さ 358
改行キー(\n)が存在するか: False


## 2. 試してみる。

In [31]:
import nlpaug.augmenter.char as nac

In [32]:
sample_text = test_docs.iloc[1]['text']
print(f"Original: {len(sample_text)}\n{sample_text}")

aug = nac.KeyboardAug()
augmented_text = aug.augment(sample_text)
print(f"\nKeybord Augmentated: {len(augmented_text)}\n{augmented_text}")

Original: 2788
Not everyone thinks the same way, if you ask someone for their opinion and they suggest one thing, not everyone will agree with what they suggest. However if you ask more than one person for their opinion, you will get a better idea of what people want and like. If you can only buy one shirt but have two shirts you like, you could ask your friends. Assuming you have good friends they will give you their honest opinion. You can than then buy the shirt that most of your friends suggest.

Getting multiple opinions can make you happier. Your friends know you very well and will sometimes know what you want before you even know what you want. A lot of your friends can make big decisions for you because they know what you want. If you want to get a hair cut, but have no idea how you want to get it cut, ask some of your closest friends. They can give you a great idea that will make you happy.

It can also reduce stress in your everyday life. If you don't know what you should buy

問題点
1. 変更箇所が分かりずらい
2. "'"が" ' "とスペースが入ってしまっている

前処理と変更箇所のハイライトを行う関数を書く

In [33]:
# 見やすいように余分なスペースをなくす
sample_text = " ".join([x.strip() for x in sample_text.split()])
print(len(sample_text.split()))     # annotationと長さが変わらない

530


In [34]:
def print_and_highlight_diff(orig_text, new_texts):
    orig_split = orig_text.split()
    print(f"Original: {len(orig_split)}\n{orig_text}\n")
    new_texts = [x.replace(" ' ", "'") for x in new_texts]
    for new_text in new_texts:
        print(f"\nAugmented: {len(new_text.split())}")
        for i, word in enumerate(new_text.split()):
            if i < len(orig_split) and word == orig_split[i]:
                print(word, end=" ")
            else:
                print('\033[31m' + word + '\033[0m', end=" ")
        print()

## 3. Keyboard Aug

In [35]:
aug = nac.KeyboardAug(include_numeric=False, include_special_char=False, aug_char_max=1, aug_word_p=0.1, aug_word_max=20)       # 数字, !, ?等にはならない。文中の10%がaugmentationされる(max 20単語まで)
augmented_texts = aug.augment(sample_text, n=3)     # n=3にすることで3つの異なるaugmentationが出てくる
print_and_highlight_diff(sample_text, augmented_texts)

Original: 530
Not everyone thinks the same way, if you ask someone for their opinion and they suggest one thing, not everyone will agree with what they suggest. However if you ask more than one person for their opinion, you will get a better idea of what people want and like. If you can only buy one shirt but have two shirts you like, you could ask your friends. Assuming you have good friends they will give you their honest opinion. You can than then buy the shirt that most of your friends suggest. Getting multiple opinions can make you happier. Your friends know you very well and will sometimes know what you want before you even know what you want. A lot of your friends can make big decisions for you because they know what you want. If you want to get a hair cut, but have no idea how you want to get it cut, ask some of your closest friends. They can give you a great idea that will make you happy. It can also reduce stress in your everyday life. If you don't know what you should buy, o

Keyboard Augは単語中の1文字だけがランダムな文字に置き換えられている

## 4. Spelling Aug

よくあるミススペリングで置き換えたものがSpelling Augである。  
Keyboard Augよりも不自然でない

In [36]:
import nlpaug.augmenter.word as naw
aug = naw.SpellingAug(aug_max=30, aug_p=0.2)       
augmented_texts = aug.augment(sample_text, n=3)     # n=3にすることで3つの異なるaugmentationが出てくる
print_and_highlight_diff(sample_text, augmented_texts)

Original: 530
Not everyone thinks the same way, if you ask someone for their opinion and they suggest one thing, not everyone will agree with what they suggest. However if you ask more than one person for their opinion, you will get a better idea of what people want and like. If you can only buy one shirt but have two shirts you like, you could ask your friends. Assuming you have good friends they will give you their honest opinion. You can than then buy the shirt that most of your friends suggest. Getting multiple opinions can make you happier. Your friends know you very well and will sometimes know what you want before you even know what you want. A lot of your friends can make big decisions for you because they know what you want. If you want to get a hair cut, but have no idea how you want to get it cut, ask some of your closest friends. They can give you a great idea that will make you happy. It can also reduce stress in your everyday life. If you don't know what you should buy, o

## 5. Synonym Aug

似たような表現に変更するAugmentation

In [37]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

aug = naw.SynonymAug()
augmented_texts = aug.augment(sample_text, n=3)
print_and_highlight_diff(sample_text, augmented_texts)

[nltk_data] Downloading package wordnet to /Users/kakeru/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/kakeru/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/kakeru/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Original: 530
Not everyone thinks the same way, if you ask someone for their opinion and they suggest one thing, not everyone will agree with what they suggest. However if you ask more than one person for their opinion, you will get a better idea of what people want and like. If you can only buy one shirt but have two shirts you like, you could ask your friends. Assuming you have good friends they will give you their honest opinion. You can than then buy the shirt that most of your friends suggest. Getting multiple opinions can make you happier. Your friends know you very well and will sometimes know what you want before you even know what you want. A lot of your friends can make big decisions for you because they know what you want. If you want to get a hair cut, but have no idea how you want to get it cut, ask some of your closest friends. They can give you a great idea that will make you happy. It can also reduce stress in your everyday life. If you don't know what you should buy, o

文字数が変わっているためAnnotationまで動かす必要がある。できないこともないだろうがめんどい

## 6. WordEmbsAug

適当なEmbeddingから持ってくる。(GloVe, Google Newsなど)

In [38]:
import nlpaug.augmenter.word.word_embs as nawwe

google_aug = nawwe.WordEmbsAug(model_type='glove', model_path="../models/augmentation/glove.6B.300d.txt", top_k=30, aug_max=25, aug_p=0.1)

In [39]:
augmented_texts = google_aug.augment(sample_text, n=5)
print_and_highlight_diff(sample_text, augmented_texts)

Original: 530
Not everyone thinks the same way, if you ask someone for their opinion and they suggest one thing, not everyone will agree with what they suggest. However if you ask more than one person for their opinion, you will get a better idea of what people want and like. If you can only buy one shirt but have two shirts you like, you could ask your friends. Assuming you have good friends they will give you their honest opinion. You can than then buy the shirt that most of your friends suggest. Getting multiple opinions can make you happier. Your friends know you very well and will sometimes know what you want before you even know what you want. A lot of your friends can make big decisions for you because they know what you want. If you want to get a hair cut, but have no idea how you want to get it cut, ask some of your closest friends. They can give you a great idea that will make you happy. It can also reduce stress in your everyday life. If you don't know what you should buy, o

これも文字数が変わったりembeddingがおかしかったりで面倒

Gloveはちょいコードの修正が必要。("'"周りがおかしい？)(try:exceptで長さが違うときにもう一度繰り返す方式でいけると思われ)  
GoogleはなんかURLとかになってて面倒  
FastTextは良さげやけど、it.Aとかand.thereとかになってて扱いにくい。

GloVeが一番押し

## 7. ContextualWordEmbsAug

BERTやXLNetなどのモデルのEmbedding層を取り出して用いる方法

In [40]:
import nlpaug.augmenter.word.context_word_embs as nawcwe

aug = nawcwe.ContextualWordEmbsAug(model_path='../models/bert-base-uncased')

In [41]:
augmented_texts = aug.augment(sample_text, n=3)
print_and_highlight_diff(sample_text, augmented_texts)

Original: 530
Not everyone thinks the same way, if you ask someone for their opinion and they suggest one thing, not everyone will agree with what they suggest. However if you ask more than one person for their opinion, you will get a better idea of what people want and like. If you can only buy one shirt but have two shirts you like, you could ask your friends. Assuming you have good friends they will give you their honest opinion. You can than then buy the shirt that most of your friends suggest. Getting multiple opinions can make you happier. Your friends know you very well and will sometimes know what you want before you even know what you want. A lot of your friends can make big decisions for you because they know what you want. If you want to get a hair cut, but have no idea how you want to get it cut, ask some of your closest friends. They can give you a great idea that will make you happy. It can also reduce stress in your everyday life. If you don't know what you should buy, o

時間がかかるし、tokenizeに時間がかかる(10s)。これがいいらしいがちょいと辛い。  
なんかtokenizeの都合により変な風になる

## 8. TfIdfAug

TF-IDFによるトピック類似度によるAugをやってくれるらしい

In [42]:
# TF-IDFの訓練をする必要があり、非常にめんどくさい(各foldについてtrainのみでtrainする必要があるため)

## 9. まとめ

(WordEmbAug -> TfIdfAug -> SpellingAug) or (TfIdfAug -> WordEmbAug -> SpellingAug)でやりたい。  
NERタグを何とかできるようになったらSpellingAugをSynonimAugに変えるべき

NERタグを変更できるように自分でコード書いた方が楽かもしらん