<a href="https://colab.research.google.com/github/packtpublishing/Machine-Learning-for-Imbalanced-Data/blob/main/chapter07/Data_level_techniques_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

References: https://amitness.com/2020/05/data-augmentation-for-nlp/

Dataset taken from: https://www.kaggle.com/datasets/team-ai/spam-text-message-classification?datasetId=2050&searchQuery=sampling

Dataset License: CC0: Public Domain

Some of the code is adapted from https://www.kaggle.com/code/jth359/imbalanced-target-variable-with-text-data

In [1]:
import pandas as pd
import numpy as np

url = "https://drive.google.com/file/d/1HQ4mqidhJKLhNEOd6agMDv_nl2caSlNB/view?usp=share_link"
url = "https://drive.google.com/uc?id=" + url.split("/")[-2]
df = pd.read_csv(url)

df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
df["Category"].value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

87% of our data is of class ham and 13% is of class spam

For this notebook, I am going to be focusing on different techniques for handling imbalanced classes.  For this reason I am going to be using TF-IDF and a Logistic Regression Classifier for all of the different techniques.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# convert all text to lowercase
df["Message"] = df["Message"].str.lower()

# perform train test split
X_train, X_test, y_train, y_test = train_test_split(
    df["Message"], df["Category"], stratify=df["Category"], random_state=11
)

# vectorize text using TFIDF
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [4]:
from collections import Counter

print(len(X_train))
print(Counter(y_train))

4179
Counter({'ham': 3619, 'spam': 560})


To begin we are starting with a logistic regression model where we do not do anything to the classes even though they are imbalanced





### Baseline Classifier

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

lr = LogisticRegression(random_state=42)
lr.fit(X_train_tfidf, y_train)
print(classification_report(y_test, lr.predict(X_test_tfidf)))

              precision    recall  f1-score   support

         ham       0.97      1.00      0.98      1206
        spam       0.99      0.78      0.87       187

    accuracy                           0.97      1393
   macro avg       0.98      0.89      0.93      1393
weighted avg       0.97      0.97      0.97      1393



I see that I get a relatively low recall on the minority class `spam` of 0.85

### Random Over Sampling

Next I am going to try random over sampling

In [8]:
df["Category"].value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

In [9]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_ros, y_ros = ros.fit_resample(X_train_tfidf, y_train)

# check distribution after applying over sampling
y_ros.value_counts()

ham     3619
spam    3619
Name: Category, dtype: int64

Applying the same model with the over sampled data

In [10]:
lr = LogisticRegression(random_state=42)
lr.fit(X_ros, y_ros)
print(classification_report(y_test, lr.predict(X_test_tfidf)))

              precision    recall  f1-score   support

         ham       0.99      0.99      0.99      1206
        spam       0.93      0.95      0.94       187

    accuracy                           0.98      1393
   macro avg       0.96      0.97      0.96      1393
weighted avg       0.98      0.98      0.98      1393



F1-Score for the minority class went down to 0.88

I see that I get a relatively low recall on the minority class `spam` of 0.85

### Random Underampling

Next I am going to try random undersampling

In [11]:
df["Category"].value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

In [12]:
from imblearn.under_sampling import RandomUnderSampler

ros = RandomUnderSampler()
X_ros, y_ros = ros.fit_resample(X_train_tfidf, y_train)

# check distribution after applying over sampling
y_ros.value_counts()

ham     560
spam    560
Name: Category, dtype: int64

Applying the same model with the over sampled data

In [13]:
lr = LogisticRegression(random_state=42)
lr.fit(X_ros, y_ros)
print(classification_report(y_test, lr.predict(X_test_tfidf)))

              precision    recall  f1-score   support

         ham       0.99      0.98      0.98      1206
        spam       0.85      0.94      0.90       187

    accuracy                           0.97      1393
   macro avg       0.92      0.96      0.94      1393
weighted avg       0.97      0.97      0.97      1393



F1-Score for the minority class is 0.89

### Data Augmentation

Now we will try translating the Spam Messages to another language and then translate them back to English.  The idea is that we will add a little noise by performing a translation.

An example of this can be seen below

Now lets see an example of this for a single message.  I am going to take a message, translate it to French, and then translate it back to English

# Text augmentation using Back Translation using nlpaug library


In [None]:
!pip install numpy requests nlpaug transformers sacremoses

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 kB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-c

In [None]:
import nlpaug.augmenter.word as naw

text = "The quick brown fox jumped over the lazy dog"
back_translation_aug = naw.BackTranslationAug(
    from_model_name="facebook/wmt19-en-de", to_model_name="facebook/wmt19-de-en"
)
back_translation_aug.augment(text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/825 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Some weights of FSMTForConditionalGeneration were not initialized from the model checkpoint at facebook/wmt19-en-de and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/825 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Some weights of FSMTForConditionalGeneration were not initialized from the model checkpoint at facebook/wmt19-de-en and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Downloading (…)/main/vocab-src.json:   0%|          | 0.00/849k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/315k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Downloading (…)/main/vocab-src.json:   0%|          | 0.00/849k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/315k [00:00<?, ?B/s]

['The speedy brown fox leapt over the lazy dog']

In [None]:
df

Unnamed: 0,Category,Message
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,this is the 2nd time we have tried 2 contact u...
5568,ham,will ü b going to esplanade fr home?
5569,ham,"pity, * was in mood for that. so...any other s..."
5570,ham,the guy did some bitching but i acted like i'd...


In [None]:
df_train = pd.concat([X_train, y_train], axis=1)
df_train.head(2)

Unnamed: 0,Message,Category
1426,i'll be at mu in like &lt;#&gt; seconds,ham
1247,"i do know what u mean, is the king of not hav...",ham


In [None]:
import time
import nlpaug.augmenter.word as naw

back_translation_aug = naw.BackTranslationAug(
    from_model_name="facebook/wmt19-en-de", to_model_name="facebook/wmt19-de-en"
)
translated_texts = []

for message in df_train[df_train["Category"] == "spam"]["Message"]:
    new_msg = back_translation_aug.augment(message)[0]
    print(message, "  |   ", new_msg)
    translated_texts.append(new_msg)

for sale - arsenal dartboard. good condition but no doubles or trebles!   |    for sale - Arsenal dartboard. Good condition, but no doubles or triplets!
500 new mobiles from 2004, must go! txt: nokia to no: 89545 & collect yours today!from only £1 www.4-tc.biz 2optout 087187262701.50gbp/mtmsg18 txtauction   |    500 new phones from 2004, must go! txt: nokia to no: 89545 & collect yours today! from just £1 www.4-tc.biz 2optout 087187262701.50gbp / mtmsg18 txtauction
urgent! we are trying to contact u. todays draw shows that you have won a £2000 prize guaranteed. call 09058094507 from land line. claim 3030. valid 12hrs only   |    s draw shows you have won a guaranteed prize of £2000. call 09058094507 from country. claim 3030. valid only 12 hours
text banneduk to 89555 to see! cost 150p textoperator g696ga 18+ xxx   |    Text banneduk to 89555 to see! cost 150p textoperator g696ga 18 + xxx
as a sim subscriber, you are selected to receive a bonus! get it delivered to your door, txt the wo

KeyboardInterrupt: ignored

In [None]:
len(translated_texts)

301

In [None]:
translations_df = pd.DataFrame({"Message": translated_texts, "Category": "spam"})
df_train_translations = pd.concat([df_train, translations_df])

In [None]:
df_train_translations["Category"].value_counts()

ham     3609
spam     871
Name: Category, dtype: int64

In [None]:
df_train_translations["Message"] = df_train_translations["Message"].str.lower()

In [None]:
# perform TFIDF
X_train_trans_tfidf = tfidf.transform(df_train_translations["Message"])

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


lr = LogisticRegression()
lr.fit(X_train_trans_tfidf, df_train_translations["Category"])
print(classification_report(y_test, lr.predict(X_test_tfidf)))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1216
        spam       0.96      0.86      0.91       177

    accuracy                           0.98      1393
   macro avg       0.97      0.93      0.95      1393
weighted avg       0.98      0.98      0.98      1393



# Text augmentation using simple EDA: Easy Data Augmentation Techniques

Code from paper authors: https://github.com/jasonwei20/eda_nlp

Simple Data Augmentatons Techniques are:
- SR : Synonym Replacement
- RD : Random Deletion
- RS : Random Swap
- RI : Random Insertion

### EDA

In [6]:
%%capture
!pip install numpy requests
!pip install -U nltk
import nltk
nltk.download('wordnet')

In [7]:
df_train = pd.concat([X_train, y_train], axis=1)
df_train.head(2)

Unnamed: 0,Message,Category
1280,waiting 4 my tv show 2 start lor... u leh stil...,ham
1658,s:-)if we have one good partnership going we w...,ham


In [8]:
# Adapted from https://github.com/jasonwei20/eda_nlp

"""
Take an input sentence,
an alpha_sr parameter (controls the number of synonym replacement operations)
an alpha_ri parameter (controls the number of random insertion operations)
an alpha_rs parameter (controls the number of random swap operations)
an alpha_rd parameter (controls the number of random deletion operations)
"""

import random
from nltk.corpus import wordnet


def get_synonyms(word):
    """Get synonyms of a word from WordNet."""
    synonyms = wordnet.synsets(word)
    if synonyms:
        words = [lem.name() for lem in synonyms[0].lemmas()]
        return words
    else:
        return []


def replace_with_synonyms(message, alpha_sr=0.1):
    """
    Randomly choose n words from the sentence that are not stop words.
    Replace each of these words with one of its synonyms chosen at random.
    """
    words = message.split(" ")
    num_words = len(words)
    n = max(1, int(alpha_sr * num_words))
    for _ in range(n):
        word_to_replace = random.choice(words)
        synonyms = get_synonyms(word_to_replace)
        if synonyms:
            words[words.index(word_to_replace)] = random.choice(synonyms)
    return " ".join(words)


def random_deletion(message, alpha_rd=0.1):
    """
    For each word in the sentence, randomly remove it with probability p
    """
    words = message.split(" ")
    num_words = len(words)
    n = max(1, int(alpha_rd * num_words))
    for _ in range(n):
        if len(words) > 1:
            del words[random.randint(0, len(words) - 1)]
    return " ".join(words)


def random_insertion(message, alpha_ri=0.1):
    """
    Find a random synonym of a random word in the sentence that is not a stop word.
    Insert that synonym into a random position in the sentence. Do this n times.
    """
    words = message.split(" ")
    num_words = len(words)
    n = max(1, int(alpha_ri * num_words))
    for _ in range(n):
        synonym_word = random.choice(words)
        synonyms = get_synonyms(synonym_word)
        if synonyms:
            words.insert(random.randint(0, len(words)), random.choice(synonyms))
    return " ".join(words)


def random_swap(message, alpha_rs=0.1):
    """
    Randomly choose two words in the sentence and swap their positions.
    Do this n times.
    """
    words = message.split(" ")
    num_words = len(words)
    n = max(1, int(alpha_rs * num_words))
    for _ in range(n):
        if len(words) > 1:
            idx1, idx2 = random.sample(range(len(words)), 2)
            words[idx1], words[idx2] = words[idx2], words[idx1]
    return " ".join(words)

In [9]:
augmented_texts = []
for message in df_train[df_train["Category"] == "spam"]["Message"]:
    message = replace_with_synonyms(message)
    message = random_deletion(message)
    message = random_insertion(message)
    augmented_text = random_swap(message)
    print(message, "  |   ", augmented_text)
    augmented_texts.append(augmented_text)

winner!! as a valued network client you have been selected to receivea £900 to claim call 09061701461. claim code kl341. valid 12 be hours only.   |    selected as a valued network client you have been winner!! to receivea £900 12 claim call 09061701461. claim code kl341. valid to be hours only.
your account has been credited 500 free text messages. to activate, just txt the word: credit credit to no: 80488 www.80488.biz   |    account your has been credited 500 free text messages. to activate, just txt credit word: the credit to no: 80488 www.80488.biz
urgent! we are trying to contact u todays draw shows that have won a £800 prize guaranteed. call from land line. claim U j89. po box245c2150pm   |    todays we are trying to prize u urgent! draw shows that have won a £800 contact guaranteed. call from land line. claim U j89. po box245c2150pm
want to operator funk up ur fone with weekly new tone reply tones2u 2 this text. www.ringtones.co.uk, the original n best. 3gbp network operator ra

In [10]:
eda_df = pd.DataFrame({"Message": augmented_texts, "Category": "spam"})
df_train_eda = pd.concat([df_train, eda_df])

In [11]:
df_train_eda["Category"].value_counts()

ham     3619
spam    1120
Name: Category, dtype: int64

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

df_train_eda["Message"] = df_train_eda["Message"].str.lower()

df_train_eda["Message"] = df_train_eda["Message"].fillna(" ")

# perform TFIDF
X_train_eda_tfidf = tfidf.transform(df_train_eda["Message"])
lr = LogisticRegression()
lr.fit(X_train_eda_tfidf, df_train_eda["Category"])
print(classification_report(y_test, lr.predict(X_test_tfidf)))

              precision    recall  f1-score   support

         ham       0.98      0.99      0.99      1206
        spam       0.96      0.90      0.93       187

    accuracy                           0.98      1393
   macro avg       0.97      0.95      0.96      1393
weighted avg       0.98      0.98      0.98      1393

