#Data Augmentation


Data augmentation methods are essential for expanding datasets by generating synthetic data from existing ones.

While image augmentation is a well-established practice in computer vision, text augmentation is a relatively novel concept in the field of natural language processing (NLP).

The nlpaug module offers a range of efficient text augmentation algorithms capable of enhancing the performance of NLP models significantly.

In the previous part, we explored some exciting text augmentation functions available in the nlpaug module.

Now, in Part 2 , we will leverage the capabilities of the nlpaug module to produce text augmentations for Twitter tweets. Subsequently, we will assess the performance of bag-of-words models with and without employing text augmentations.

In [1]:
!pip install --upgrade gensim --quiet

In [2]:
import gensim
print(gensim.__version__)

4.3.2


In [3]:
!pip install transformers --quiet

In [4]:
import transformers

In [5]:
# Install the tokenizer needed by the back translation model
!pip install sacremoses --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
# Install the tokenizer
import sacremoses

In [7]:
# Install the nlpaug module
!pip install nlpaug --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [8]:

# Import the nlpaug module and its methods
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action

In [9]:

# Import other modules
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
import os
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [10]:

# Show all outputs of a cell in a jupyter notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Twitter Climate Change Sentiment Dataset

The dataset can be downloaded from Kaggle.https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset


The collection of this data was funded by a Canada Foundation for Innovation JELF Grant to Chris Bauch, University of Waterloo.


This dataset aggregates tweets pertaining to climate change collected between Apr 27, 2015 and Feb 21, 2018. In total, 43943 tweets were annotated. Each tweet is labelled independently by 3 reviewers. This dataset only contains tweets that all 3 reviewers agreed on (the rest were discarded).


Each tweet is labelled as one of the following classes:

2 (News): the tweet links to factual news about climate change
1 (Pro): the tweet supports the belief of man-made climate change
0 (Neutral): the tweet neither supports nor refutes the belief of man-made climate change
-1 (Anti): the tweet does not believe in man-made climate change

In [13]:

df = pd.read_csv("/content/twitter_sentiment_data.csv", encoding='utf-8')


# Rename column names and remove the tweetid column
df = df.rename(columns = {"sentiment": "label", "message": "text"}).drop('tweetid', axis = 1)

In [14]:
# Recode the labels to 0, 1, 2, and 3
# 0 - negative; 1 - neutral; 2 - positive; 3 - news
df['label'] = df['label'].replace([-1, 0, 1, 2],[0, 1, 2, 3])

In [16]:
# Important Note: Check the integrity of the DataFrame to ensure that there are no missing values,
# which will deter the training progress. Here, we simply drop any missing observations.
df = df.dropna()

In [17]:
# Take a look at the first five samples in the dataframe
pd.set_option('display.max_colwidth', None)
df.head()

Unnamed: 0,label,text
0,1,@tiniebeany climate change is an interesting hustle as it was global warming but the planet stopped warming for 15 yes while the suv boom
1,3,"RT @NatGeoChannel: Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change https://t.co/LkDehj3tNn httÃ¢â‚¬Â¦"
2,3,Fabulous! Leonardo #DiCaprio's film on #climate change is brilliant!!! Do watch. https://t.co/7rV6BrmxjW via @youtube
3,3,RT @Mick_Fanning: Just watched this amazing documentary by leonardodicaprio on climate change. We all think thisÃ¢â‚¬Â¦ https://t.co/kNSTE8K8im
4,3,"RT @cnalive: Pranita Biswasi, a Lutheran from Odisha, gives testimony on effects of climate change &amp; natural disasters on the poÃ¢â‚¬Â¦"


#Tokenization

Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning.


The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).

In [18]:
# Initiate the NLTK word tokenizer
tokenizer = nltk.tokenize.TreebankWordTokenizer()

# Take a look at an example of tokenization
tokenizer.tokenize("I've been to Los Angeles before.")


['I', "'ve", 'been', 'to', 'Los', 'Angeles', 'before', '.']

#Stop Words

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words.


In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [19]:

# Get a set of stop words from NLTK English dictionary
stop_words = set(stopwords.words('english'))

# Take a look at the stop words
print(stop_words)

# There are 179 stop words in total
print(len(stop_words))

{'no', "aren't", 'him', 'theirs', 'in', "didn't", 'should', 'be', 'both', 'weren', 'her', 'of', 'am', 'you', 'until', 'if', 'itself', 'has', 'some', 'out', 'd', 'hers', 'who', 'here', 'any', 'its', 'an', 'once', 'between', 'again', "hasn't", "won't", 'nor', "hadn't", 'all', "wouldn't", 'their', 'why', 'll', 'on', 'ain', 'ourselves', 'hadn', 'ours', 'when', 'then', 'the', 'very', 'his', 'about', 'under', "couldn't", "doesn't", 'he', 'through', 'my', 'our', 'now', 've', 'too', 'for', 'that', 'doing', 'with', 'there', 'hasn', 'aren', "you've", 'but', 'haven', 'a', 'just', 's', 'can', 'down', "you'll", 'from', 'few', 'she', 'before', 'most', 'which', 'this', 'off', 'not', 'them', 'yourselves', 'are', 'against', 'further', "mightn't", 'yours', 'below', 'as', 'is', 'up', 'being', "should've", "that'll", "mustn't", 'was', 'have', 'your', 'more', 'myself', 'during', 'isn', 'by', 'herself', 'couldn', 'were', "haven't", 'or', 'm', 'wouldn', 'it', "you'd", "don't", 'yourself', 'those', 'been', 'y

In [20]:

# Split the data into 90% train and 10% test
X_train, X_test, y_train, y_test = train_test_split(
  df['text'], df['label'], test_size = 0.1)
X_train.shape, X_test.shape

((39548,), (4395,))

In [21]:

# Print a single example from the train set
pd.DataFrame(X_train).iloc[0]
pd.DataFrame(y_train).iloc[0]

text    RT @nowthisnews: The Trump administration thinks protecting our planet from climate change is a waste of money https://t.co/QTGMi3Iv6U
Name: 16319, dtype: object

label    3
Name: 16319, dtype: int64

#Lemmatization vs. Stemming
\

Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.
\

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster.

Examples:

The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

The word "walk" is the base form for the word "walking", and hence this is matched in both stemming and lemmatisation.

The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation attempts to select the correct lemma depending on the context.

In [22]:
# Initiate the NLTK word lemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()

# Take a look at an example of lemmatization
lemmatizer.lemmatize('hours')

'hour'

In [23]:
# As an alternative to lemmatization, we could use stemming instead
stemmer = nltk.stem.SnowballStemmer("english")

# Take a look at an example of stemming
stemmer.stem("having")

'have'

In [24]:
# Create a function to clean text data
def preprocessor(text):

  # Remove all html markup from a tweet
  text = re.sub('<[^>]*>', '', text)

  # Remove @username from a tweet
  text = re.sub(r"@[^\s]+",'', text)

  # Remove http links from a tweet
  text = re.sub('http[^\s]+', '', text)

  # Find all emoticons and store them temporarily
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)

  # Remove all non-word characters, make all words to lower case, and add back the stored emoticons
  text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
  return text

In [25]:
# Apply the above preprocessor to clean the tweet texts
X_train = X_train.apply(preprocessor)
X_test = X_test.apply(preprocessor)

# Take a look at the first five tweets
pd.DataFrame(X_train[:5])

Unnamed: 0,text
16319,rt the trump administration thinks protecting our planet from climate change is a waste of money
4522,evidence of global warming overwhelming ã â â œ kerry ã â â œ radio newã zealand
17936,the most damaging part of trump s climate change order is the message it sends via
13195,rt trump really needs to watch this film about climate change and national security
7739,arctic ice melt could trigger uncontrollable climate change at global level


In [26]:
# Create a function to tokenize and lemmatize text
def tokenizer_lemmetizer(text):

  # Tokenize a tweet
  text = tokenizer.tokenize(text)

  # Remove stop words and convert a tweet to lower case
  text = [token for token in text if token not in stop_words]

  # Stem each token and combine them into a single string
  return ' '.join([lemmatizer.lemmatize(word) for word in text])

In [27]:
# Apply the above function on the train and test sets
X_train = X_train.apply(tokenizer_lemmetizer)
X_test = X_test.apply(tokenizer_lemmetizer)

# Take a look at the first five tweets
pd.DataFrame(X_train[:5])

Unnamed: 0,text
16319,rt trump administration think protecting planet climate change waste money
4522,evidence global warming overwhelming ã â â œ kerry ã â â œ radio newã zealand
17936,damaging part trump climate change order message sends via
13195,rt trump really need watch film climate change national security
7739,arctic ice melt could trigger uncontrollable climate change global level


In [28]:
# Construct the vocabulary of the bag-of-words model
count = CountVectorizer(
  # Remove stop words
  stop_words = 'english',
  # Create 1-gram vocabulary (i.e., a single word)
  # Note: use (1, 2) to create 2-gram vocabulary
  ngram_range = (1, 1),
  # Build a vocabulary of 10000 most frequent words
  max_features = 10000)

In [29]:
# Fit and transform the train set into sparse feature vectors
X_train_bag = count.fit_transform(X_train)
print(X_train_bag.shape)

# Transform the test set into sparse feature vectors
X_test_bag = count.transform(X_test)
print(X_test_bag.shape)

(39548, 10000)
(4395, 10000)


In [30]:
# Show the library of vocabulary
print(len(count.vocabulary_))
print(count.vocabulary_)

10000


#Term Frequency Inverse Document Frequency

Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus).


Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.


Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).


The TF-IDF of a term is calculated by multiplying TF and IDF scores.

In [31]:

# Take the raw term frequencies built by CountVectorizer as input and
# transform them into the term frequency-inverse document frequency (tf-idf)
tfidf = TfidfTransformer(use_idf = True, norm = 'l2', smooth_idf = True)
X_train_tfidf = tfidf.fit_transform(X_train_bag)
X_test_tfidf = tfidf.transform(X_test_bag)

In [32]:
print(X_train_tfidf.toarray())
print('\n')
X_train_tfidf.toarray().shape

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]




(39548, 10000)

In [33]:
# Build a logistic model
log_tfidf = LogisticRegression(solver = 'liblinear', random_state = 42)
score = log_tfidf.fit(X_train_tfidf, y_train).score(X_test_tfidf, y_test)
print(score)

0.7870307167235495


In [34]:
# Compare the model predictions to the baseline of using dummy classifier
dummy_classifier = DummyClassifier(strategy = 'stratified')
dummy_classifier.fit(X_train_tfidf, y_train).score(X_test_tfidf, y_test)

0.5772468714448237

In [35]:
# Initiate the synonym augmentation
aug_syn = naw.SynonymAug(
  aug_src = 'wordnet',
  aug_max = 3)

In [36]:
import torch
print(torch.cuda.is_available())


True


In [41]:
## Initiate the contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet) augmentation
aug_emb = naw.ContextualWordEmbsAug(
  # Other models include 'distilbert-base-uncased', 'roberta-base', etc.
  model_path = 'roberta-base',
  # You can also choose "insert"
  action = "substitute",
  device = 'cuda'
  )

In [42]:
# Initiate the back translation augmentation
aug_bt = naw.BackTranslationAug(
  # Translate English to German
  from_model_name = 'facebook/wmt19-en-de',
  # Translate German back to English
  to_model_name = 'facebook/wmt19-de-en',
  # # Use GPU
  device = 'cuda'
)

Some weights of FSMTForConditionalGeneration were not initialized from the model checkpoint at facebook/wmt19-en-de and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of FSMTForConditionalGeneration were not initialized from the model checkpoint at facebook/wmt19-de-en and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [43]:
import pandas as pd

# Create a function to evaluate text augmentation on model performance on test set
def evaluate_aug(aug_strategy, n, X_train, y_train, X_test, y_test):

  # Create two lists to store augmented tweets and their corresponding labels
  augmented_tweets = []
  augmented_tweets_labels = []

  # Loop through the train set to create augmented tweets
  # Note: We create n augmented tweets per original tweet.
  for i in X_train.index:
    if aug_strategy == 'synonym':
      lst_augment = aug_syn.augment(X_train[i], n = n)
    elif aug_strategy == 'embedding':
      lst_augment = aug_emb.augment(X_train[i], n = n)
    else:
      lst_augment = aug_bt.augment(X_train[i], n = n)
    for augment in lst_augment:
      augmented_tweets.append(augment)
      augmented_tweets_labels.append(y_train[i])

  # Convert augmented tweets and labels to Series
  augmented_tweets_series = pd.Series(augmented_tweets, name=X_train.name)
  augmented_labels_series = pd.Series(augmented_tweets_labels, name=y_train.name)

  # Concatenate the augmented data with the original data
  X_train_appended = pd.concat([X_train, augmented_tweets_series], ignore_index=True)
  y_train_appended = pd.concat([y_train, augmented_labels_series], ignore_index=True)

  # Apply the preprocessor to clean the tweet texts
  X_train_appended = X_train_appended.apply(preprocessor)

  # Apply tokenization and lemmatization
  X_train_appended = X_train_appended.apply(tokenizer_lemmetizer)

  # Fit and transform the appended train set into sparse feature vectors
  X_train_appended_bag = count.fit_transform(X_train_appended)

  # Transform the test set into sparse feature vectors
  X_test_bag = count.transform(X_test)

  # Take the raw term frequencies built by CountVectorizer as input and
  # transform them into the term frequency-inverse document frequency (tf-idf)
  X_train_appended_tfidf = tfidf.fit_transform(X_train_appended_bag)
  X_test_tfidf = tfidf.transform(X_test_bag)

  # Remove all elements from the lists
  augmented_tweets.clear()
  augmented_tweets_labels.clear()

  # Fit a logistic regression
  return (log_tfidf.fit(X_train_appended_tfidf, y_train_appended).
          score(X_test_tfidf, y_test))


In [44]:
# Evaluate the synonym text augmentation
score_synonym = evaluate_aug(
  aug_strategy = 'synonym',
  n = 1,
  X_train = X_train,
  y_train = y_train,
  X_test = X_test,
  y_test = y_test)
print(score_synonym)

0.7936291240045507


In [45]:
# Evaluate the embedding text augmentation (less than 1 hour)
score_emb = evaluate_aug(
  aug_strategy = 'embedding',
  n = 1,
  X_train = X_train,
  y_train = y_train,
  X_test = X_test,
  y_test = y_test)
print(score_emb)

0.7920364050056883


In [None]:
# Evaluate the back translation text augmentation (~10 hours)
score_bt = evaluate_aug(
  aug_strategy = 'backtranslation',
  n = 1,
  X_train = X_train,
  y_train = y_train,
  X_test = X_test,
  y_test = y_test)
print(score_bt)