Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Augmentation for Text Data #279

Open
ertugrul-dmr opened this issue Jun 8, 2021 · 0 comments
Open

Data Augmentation for Text Data #279

ertugrul-dmr opened this issue Jun 8, 2021 · 0 comments
Assignees

Comments

@ertugrul-dmr
Copy link
Collaborator

It's hard to reach labelled text data and it's costly to label the data manually; but usually more data we have, better performance we can achieve. While working on text normalizations we can also consider text augmentations too.

Adding augmented text data might boost our model performances by increasing number of instances to train on. For this we can try several approaches, from simple to more complex ones:

Random Removal:

  • In this method we randomly select given percentage words in a document and delete them.

Synonym Replacement:

  • In this method we can augment the data by replacing words randomly with their synonyms.

Embedding Replacement:

  • In this method we can randomly replace words with most similar ones using k-nearest-neighbor and cosine similarity.
  • Alternatively we can use word2vec, GloVe, fasttext etc. for getting similar words to replace.

Character Replacement:

  • We can replace strings based on common typos with qwerty keyboards. We can randomly replace some characters with their nearest keyboard buttons.

Back Translation:

  • In this method we can translate document into another language using more complex models (like transformers) and translate them back to original language. It can give us augmented version of the document.

Text Generation:

  • We can generate synthetic data using text generators like GPT-2 This might give us more data to train with.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant