Data Augmentation for Text Data #279

ertugrul-dmr · 2021-06-08T16:42:27Z

It's hard to reach labelled text data and it's costly to label the data manually; but usually more data we have, better performance we can achieve. While working on text normalizations we can also consider text augmentations too.

Adding augmented text data might boost our model performances by increasing number of instances to train on. For this we can try several approaches, from simple to more complex ones:

Random Removal:

In this method we randomly select given percentage words in a document and delete them.

Synonym Replacement:

In this method we can augment the data by replacing words randomly with their synonyms.

Embedding Replacement:

In this method we can randomly replace words with most similar ones using k-nearest-neighbor and cosine similarity.
Alternatively we can use word2vec, GloVe, fasttext etc. for getting similar words to replace.

Character Replacement:

We can replace strings based on common typos with qwerty keyboards. We can randomly replace some characters with their nearest keyboard buttons.

Back Translation:

In this method we can translate document into another language using more complex models (like transformers) and translate them back to original language. It can give us augmented version of the document.

Text Generation:

We can generate synthetic data using text generators like GPT-2 This might give us more data to train with.

ertugrul-dmr added the interesting label Jun 8, 2021

ertugrul-dmr self-assigned this Jun 8, 2021

ertugrul-dmr mentioned this issue Jul 6, 2021

Random Removal Augmentation Test Results #282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Augmentation for Text Data #279

Data Augmentation for Text Data #279

ertugrul-dmr commented Jun 8, 2021

Data Augmentation for Text Data #279

Data Augmentation for Text Data #279

Comments

ertugrul-dmr commented Jun 8, 2021

Random Removal:

Synonym Replacement:

Embedding Replacement:

Character Replacement:

Back Translation:

Text Generation: