You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's hard to reach labelled text data and it's costly to label the data manually; but usually more data we have, better performance we can achieve. While working on text normalizations we can also consider text augmentations too.
Adding augmented text data might boost our model performances by increasing number of instances to train on. For this we can try several approaches, from simple to more complex ones:
Random Removal:
In this method we randomly select given percentage words in a document and delete them.
Synonym Replacement:
In this method we can augment the data by replacing words randomly with their synonyms.
Embedding Replacement:
In this method we can randomly replace words with most similar ones using k-nearest-neighbor and cosine similarity.
Alternatively we can use word2vec, GloVe, fasttext etc. for getting similar words to replace.
Character Replacement:
We can replace strings based on common typos with qwerty keyboards. We can randomly replace some characters with their nearest keyboard buttons.
Back Translation:
In this method we can translate document into another language using more complex models (like transformers) and translate them back to original language. It can give us augmented version of the document.
Text Generation:
We can generate synthetic data using text generators like GPT-2 This might give us more data to train with.
The text was updated successfully, but these errors were encountered:
It's hard to reach labelled text data and it's costly to label the data manually; but usually more data we have, better performance we can achieve. While working on text normalizations we can also consider text augmentations too.
Adding augmented text data might boost our model performances by increasing number of instances to train on. For this we can try several approaches, from simple to more complex ones:
Random Removal:
Synonym Replacement:
Embedding Replacement:
Character Replacement:
Back Translation:
Text Generation:
The text was updated successfully, but these errors were encountered: