# Text Augmentation

\
Data augmentation techniques are used to generate additional, synthetic data using the data you have.

\
Image augmentation has become a standard proceducre in computer vision applications but text augmentation is relatively new to natural language processing (NLP) field.

\
The [nlpaug](https://github.com/makcedward/nlpaug) module implements a number of high-performance text augmentation algorithms that may boost performance of NLP models.

\
In Part 1 of the tutorial, we introduced some cool text augmentation functions in the nlpaug module.

\
In this Part 2 of the tutorial, we will use the nlpaug module to generate text augmentations to Twitter tweet data and evaluate bag-of-words model performances with and without text augmentations.

\
References and recommended readings:
* [nlpaug Git Hub site](https://github.com/makcedward/nlpaug)
* [Data Augmentation in NLP: Introduction to Text Augmentation](https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28)

## Install and import modules

In [None]:
# Install the most recent version of gensim.
# Otherwise, you may get the following error when running naw.WordEmbsAug():
# 'Word2VecKeyedVectors' object has no attribute 'index_to_key'
# see: https://stackoverflow.com/questions/71032760/word2veckeyedvectors-object-has-no-attribute-index-to-key
!pip install --upgrade gensim --quiet

[K     |████████████████████████████████| 24.1 MB 1.1 MB/s 
[?25h

In [None]:
# Import gensim.
# Note: You will need to retart runtime in order to import the most recent version of gensim 
import gensim
print(gensim.__version__)

4.2.0


In [None]:
# Install the transformers module in order to use their base models (e.g., BERT)
!pip install transformers --quiet

[K     |████████████████████████████████| 4.9 MB 14.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 59.3 MB/s 
[K     |████████████████████████████████| 120 kB 86.5 MB/s 
[?25h

In [None]:
# Import transformers
import transformers

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [None]:
# Install the tokenizer needed by the back translation model
!pip install sacremoses --quiet

[?25l[K     |▍                               | 10 kB 37.8 MB/s eta 0:00:01[K     |▊                               | 20 kB 21.8 MB/s eta 0:00:01[K     |█▏                              | 30 kB 28.5 MB/s eta 0:00:01[K     |█▌                              | 40 kB 15.9 MB/s eta 0:00:01[K     |█▉                              | 51 kB 14.1 MB/s eta 0:00:01[K     |██▎                             | 61 kB 16.4 MB/s eta 0:00:01[K     |██▋                             | 71 kB 16.1 MB/s eta 0:00:01[K     |███                             | 81 kB 15.8 MB/s eta 0:00:01[K     |███▍                            | 92 kB 17.4 MB/s eta 0:00:01[K     |███▊                            | 102 kB 15.4 MB/s eta 0:00:01[K     |████                            | 112 kB 15.4 MB/s eta 0:00:01[K     |████▌                           | 122 kB 15.4 MB/s eta 0:00:01[K     |████▉                           | 133 kB 15.4 MB/s eta 0:00:01[K     |█████▏                          | 143 kB 15.4 MB/s eta 0:

In [None]:
# Install the tokenizer
import sacremoses

In [None]:
# Install the nlpaug module
!pip install nlpaug --quiet

[?25l[K     |▉                               | 10 kB 38.0 MB/s eta 0:00:01[K     |█▋                              | 20 kB 20.1 MB/s eta 0:00:01[K     |██▍                             | 30 kB 26.8 MB/s eta 0:00:01[K     |███▏                            | 40 kB 13.9 MB/s eta 0:00:01[K     |████                            | 51 kB 13.6 MB/s eta 0:00:01[K     |████▉                           | 61 kB 15.9 MB/s eta 0:00:01[K     |█████▋                          | 71 kB 15.0 MB/s eta 0:00:01[K     |██████▍                         | 81 kB 14.2 MB/s eta 0:00:01[K     |███████▏                        | 92 kB 15.7 MB/s eta 0:00:01[K     |████████                        | 102 kB 14.3 MB/s eta 0:00:01[K     |████████▉                       | 112 kB 14.3 MB/s eta 0:00:01[K     |█████████▋                      | 122 kB 14.3 MB/s eta 0:00:01[K     |██████████▍                     | 133 kB 14.3 MB/s eta 0:00:01[K     |███████████▏                    | 143 kB 14.3 MB/s eta 0:

In [None]:
# Import the nlpaug module and its methods
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action

In [None]:
# Import other modules
import nltk 
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
import os
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
# Show all outputs of a cell in a jupyter notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# Mount Google drive to colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Twitter Climate Change Sentiment Dataset

\
[The dataset can be downloaded from Kaggle.](https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset)

\
The collection of this data was funded by a Canada Foundation for Innovation JELF Grant to Chris Bauch, University of Waterloo.

\
This dataset aggregates tweets pertaining to climate change collected between Apr 27, 2015 and Feb 21, 2018. In total, 43943 tweets were annotated. Each tweet is labelled independently by 3 reviewers. This dataset only contains tweets that all 3 reviewers agreed on (the rest were discarded).

\
Each tweet is labelled as one of the following classes:
* 2 (News): the tweet links to factual news about climate change
* 1 (Pro): the tweet supports the belief of man-made climate change
* 0 (Neutral): the tweet neither supports nor refutes the belief of man-made climate change
* -1 (Anti): the tweet does not believe in man-made climate change

In [None]:
# Read in the dataset using Pandas
path = '/content/drive/MyDrive/Deep Learning Course/Datasets'
df = pd.read_csv(os.path.join(path, "twitter_sentiment_data.csv"))

# Rename column names and remove the tweetid column
df = df.rename(columns = {"sentiment": "label", "message": "text"}).drop('tweetid', axis = 1)

In [None]:
# Recode the labels to 0, 1, 2, and 3
# 0 - negative; 1 - neutral; 2 - positive; 3 - news 
df['label'] = df['label'].replace([-1, 0, 1, 2],[0, 1, 2, 3])

In [None]:
# Important Note: Check the integrity of the DataFrame to ensure that there are no missing values, 
# which will deter the training progress. Here, we simply drop any missing observations.
df = df.dropna()

In [None]:
# Take a look at the first five samples in the dataframe
pd.set_option('display.max_colwidth', None)
df.head()

Unnamed: 0,label,text
0,0,@tiniebeany climate change is an interesting hustle as it was global warming but the planet stopped warming for 15 yes while the suv boom
1,2,"RT @NatGeoChannel: Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change https://t.co/LkDehj3tNn httÃ¢â‚¬Â¦"
2,2,Fabulous! Leonardo #DiCaprio's film on #climate change is brilliant!!! Do watch. https://t.co/7rV6BrmxjW via @youtube
3,2,RT @Mick_Fanning: Just watched this amazing documentary by leonardodicaprio on climate change. We all think thisÃ¢â‚¬Â¦ https://t.co/kNSTE8K8im
4,3,"RT @cnalive: Pranita Biswasi, a Lutheran from Odisha, gives testimony on effects of climate change &amp; natural disasters on the poÃ¢â‚¬Â¦"


## Tokenization

\
Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning.

\
The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).

In [None]:
# Initiate the NLTK word tokenizer
tokenizer = nltk.tokenize.TreebankWordTokenizer()

# Take a look at an example of tokenization
tokenizer.tokenize("I've been to Los Angeles before.")

['I', "'ve", 'been', 'to', 'Los', 'Angeles', 'before', '.']

## Stop Words

\
Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words.

\
In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [None]:
# Get a set of stop words from NLTK English dictionary
stop_words = set(stopwords.words('english'))

# Take a look at the stop words
print(stop_words)

# There are 179 stop words in total
print(len(stop_words))

{'you', 'him', 'after', 'the', 'just', 'hadn', 'in', 'y', 'who', 'has', 'of', 'didn', 'about', 'theirs', 'each', "weren't", 'won', 'with', 'above', 'needn', 'at', 'can', 'its', 'some', "aren't", 'couldn', 'into', 'my', "should've", 'this', 'be', "couldn't", "shouldn't", 'now', 'weren', 'such', 't', 'itself', 'off', 'own', 'wouldn', 'than', 'were', 'for', 'having', 'until', "she's", 'those', 'herself', 'her', 'their', 'below', 'once', 'ourselves', "that'll", 'over', 'very', 'aren', "hadn't", "you've", "needn't", 'through', 'as', 'but', 'down', 'yourselves', 'am', 'them', 'nor', 'his', 'all', 'to', 'o', 'or', 'isn', 'your', 'was', 'hers', "mightn't", 'hasn', 'any', 'did', 'is', 'which', 'haven', 'mightn', 've', 'then', "haven't", 'myself', "don't", 'out', "doesn't", 'while', 'both', 'themselves', 'between', "you're", "you'll", 'there', 'before', 'mustn', 'it', 'they', 'by', 'd', 's', 'our', 'what', 'further', 'had', 'shan', "shan't", 'up', "didn't", 'shouldn', 'an', 'during', 'only', "mu

In [None]:
# Split the data into 90% train and 10% test
X_train, X_test, y_train, y_test = train_test_split(
  df['text'], df['label'], test_size = 0.1)
X_train.shape, X_test.shape

((39548,), (4395,))

In [None]:
# Print a single example from the train set
pd.DataFrame(X_train).iloc[0]
pd.DataFrame(y_train).iloc[0]

text    RT @CNN: Asked about climate change, Tom Bossert says “there is a cyclical nature to a lot of these hurricane seasons” https://t.co/fz2DhTk…
Name: 26766, dtype: object

label    3
Name: 26766, dtype: int64

## Lemmatization vs. Stemming

\
* Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

\
* Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster.

\
Examples:
* The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

* The word "walk" is the base form for the word "walking", and hence this is matched in both stemming and lemmatisation.

* The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation attempts to select the correct lemma depending on the context.

\
References: 
* [Wikipedia page](https://en.wikipedia.org/wiki/Lemmatisation)
* [Check here if you want to learn lemmetization vs. stemming](https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221)

In [None]:
# Initiate the NLTK word lemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()

# Take a look at an example of lemmatization
lemmatizer.lemmatize('hours')

'hour'

In [None]:
# As an alternative to lemmatization, we could use stemming instead
stemmer = nltk.stem.SnowballStemmer("english")

# Take a look at an example of stemming
stemmer.stem("having")

'have'

In [None]:
# Create a function to clean text data
def preprocessor(text):

  # Remove all html markup from a tweet
  text = re.sub('<[^>]*>', '', text)

  # Remove @username from a tweet
  text = re.sub(r"@[^\s]+",'', text)

  # Remove http links from a tweet
  text = re.sub('http[^\s]+', '', text)

  # Find all emoticons and store them temporarily
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)

  # Remove all non-word characters, make all words to lower case, and add back the stored emoticons
  text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
  return text

In [None]:
# Apply the above preprocessor to clean the tweet texts
X_train = X_train.apply(preprocessor)
X_test = X_test.apply(preprocessor)

# Take a look at the first five tweets
pd.DataFrame(X_train[:5])

Unnamed: 0,text
26766,rt asked about climate change tom bossert says there is a cyclical nature to a lot of these hurricane seasons
10948,rt bird species vanish from uk due to climate change and habitat loss
32807,when mexico pay for a positive impact on behalf of the world trade center right now we need global warming iâ ve said if
7819,rt kim jong il will always be remembered fondly for his leadership and contributions on climate change
28714,pacific island countries could lose 50 80 of fish in local waters under climate change


In [None]:
# Create a function to tokenize and lemmatize text
def tokenizer_lemmetizer(text):
  
  # Tokenize a tweet
  text = tokenizer.tokenize(text)

  # Remove stop words and convert a tweet to lower case
  text = [token for token in text if token not in stop_words]

  # Stem each token and combine them into a single string 
  return ' '.join([lemmatizer.lemmatize(word) for word in text])

In [None]:
# Apply the above function on the train and test sets 
X_train = X_train.apply(tokenizer_lemmetizer)
X_test = X_test.apply(tokenizer_lemmetizer)

# Take a look at the first five tweets
pd.DataFrame(X_train[:5])

Unnamed: 0,text
26766,rt asked climate change tom bossert say cyclical nature lot hurricane season
10948,rt bird specie vanish uk due climate change habitat loss
32807,mexico pay positive impact behalf world trade center right need global warming iâ said
7819,rt kim jong il always remembered fondly leadership contribution climate change
28714,pacific island country could lose 50 80 fish local water climate change


In [None]:
# Construct the vocabulary of the bag-of-words model
count = CountVectorizer(
  # Remove stop words    
  stop_words = 'english',
  # Create 1-gram vocabulary (i.e., a single word)
  # Note: use (1, 2) to create 2-gram vocabulary
  ngram_range = (1, 1), 
  # Build a vocabulary of 10000 most frequent words
  max_features = 10000)

In [None]:
# Fit and transform the train set into sparse feature vectors
X_train_bag = count.fit_transform(X_train)
print(X_train_bag.shape)

# Transform the test set into sparse feature vectors
X_test_bag = count.transform(X_test)
print(X_test_bag.shape)

(39548, 10000)
(4395, 10000)


In [None]:
# Show the library of vocabulary
print(len(count.vocabulary_))
print(count.vocabulary_)

10000


## Term Frequency Inverse Document Frequency

\
**Term Frequency - Inverse Document Frequency (TF-IDF)** is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus).

\
**Term Frequency**: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.

\
**Inverse Document Frequency**: IDF of a term reflects the proportion of documents in the corpus that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).

\
The TF-IDF of a term is calculated by multiplying TF and IDF scores.

\
References:
* [TF(Term Frequency)-IDF(Inverse Document Frequency) from scratch in python](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558)
* [TF-IDF — Term Frequency-Inverse Document Frequency](https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/)

In [None]:
# Take the raw term frequencies built by CountVectorizer as input and 
# transform them into the term frequency-inverse document frequency (tf-idf)
tfidf = TfidfTransformer(use_idf = True, norm = 'l2', smooth_idf = True)
X_train_tfidf = tfidf.fit_transform(X_train_bag)
X_test_tfidf = tfidf.transform(X_test_bag)

In [None]:
# Show the tfidf
print(X_train_tfidf.toarray())
print('\n')
X_train_tfidf.toarray().shape

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]




(39548, 10000)

In [None]:
# Build a logistic model
log_tfidf = LogisticRegression(solver = 'liblinear', random_state = 42)
score = log_tfidf.fit(X_train_tfidf, y_train).score(X_test_tfidf, y_test)
print(score)

0.7001137656427758


In [None]:
# Compare the model predictions to the baseline of using dummy classifier
dummy_classifier = DummyClassifier(strategy = 'stratified')
dummy_classifier.fit(X_train_tfidf, y_train).score(X_test_tfidf, y_test)

0.35813424345847555

In [None]:
# Initiate the synonym augmentation 
aug_syn = naw.SynonymAug(
  aug_src = 'wordnet',
  aug_max = 3)

In [None]:
## Initiate the contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet) augmentation 
aug_emb = naw.ContextualWordEmbsAug(
  # Other models include 'distilbert-base-uncased', 'roberta-base', etc.
  model_path = 'roberta-base', 
  # You can also choose "insert"
  action = "substitute",
  # Use GPU
  device = 'cuda'
  )

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

In [None]:
# Initiate the back translation augmentation
aug_bt = naw.BackTranslationAug(
  # Translate English to German
  from_model_name = 'facebook/wmt19-en-de', 
  # Translate German back to English
  to_model_name = 'facebook/wmt19-de-en',
  # Use GPU
  device = 'cuda')

Downloading:   0%|          | 0.00/825 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/825 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/849k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/315k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/849k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/315k [00:00<?, ?B/s]

In [None]:
# Create a function to evaluate text augmentation on model performance on test set
def evaluate_aug(aug_strategy, n, X_train, y_train, X_test, y_test):

  # Create two lists to store augmented tweets and their correponding labels
  augmented_tweets = []
  augmented_tweets_labels = []

  # Loop through the train set to create augmented tweets
  # Note: We create n augmented tweets per original tweet. 
  for i in X_train.index:
    if aug_strategy == 'synonym':
      lst_augment = aug_syn.augment(X_train[i], n = n)
    elif aug_strategy == 'embedding':
      lst_augment = aug_emb.augment(X_train[i], n = n)
    else:
      lst_augment = aug_bt.augment(X_train[i], n = n)
    for augment in lst_augment:
      augmented_tweets.append(augment)
      augmented_tweets_labels.append(y_train[i])
  
  # Append the augmented tweets to the original tweets in the train set
  X_train_appended = X_train.append(
    pd.Series(augmented_tweets), ignore_index = True)
  y_train_appended = y_train.append(
    pd.Series(augmented_tweets_labels), ignore_index = True)
  
  # Apply the preprocessor to clean the tweet texts
  X_train_appended = X_train_appended.apply(preprocessor)

  # Apply tokenization and lemmetization
  X_train_appended = X_train_appended.apply(tokenizer_lemmetizer)

  # Fit and transform the appended train set into sparse feature vectors
  X_train_appended_bag = count.fit_transform(X_train_appended)
 
  # Transform the test set into sparse feature vectors
  X_test_bag = count.transform(X_test)

  # Take the raw term frequencies built by CountVectorizer as input and 
  # transform them into the term frequency-inverse document frequency (tf-idf)
  X_train_appended_tfidf = tfidf.fit_transform(X_train_appended_bag)
  X_test_tfidf = tfidf.transform(X_test_bag)

  # Remove all elements from the lists
  augmented_tweets.clear()
  augmented_tweets_labels.clear()

  # Fit a logistic regression
  return (log_tfidf.fit(X_train_appended_tfidf, y_train_appended).
          score(X_test_tfidf, y_test))

In [None]:
# Evaluate the synonym text augmentation
score_synonym = evaluate_aug(
  aug_strategy = 'synonym', 
  n = 1, 
  X_train = X_train, 
  y_train = y_train, 
  X_test = X_test, 
  y_test = y_test)
print(score_synonym)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


0.6994311717861206


In [None]:
# Evaluate the embedding text augmentation (less than 1 hour)
score_emb = evaluate_aug(
  aug_strategy = 'embedding', 
  n = 1,
  X_train = X_train, 
  y_train = y_train, 
  X_test = X_test, 
  y_test = y_test) 
print(score_emb)

0.7001137656427758


In [None]:
# Evaluate the back translation text augmentation (~10 hours)
score_bt = evaluate_aug(
  aug_strategy = 'backtranslation', 
  n = 1, 
  X_train = X_train, 
  y_train = y_train, 
  X_test = X_test, 
  y_test = y_test)
print(score_bt)

0.7046643913538112
