# Challenge - Text-Augmentation

![](https://images.unsplash.com/photo-1534770733765-337d273901c1?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1016&q=80)

Photo by [Franck V.](https://unsplash.com/photos/oIMXkEuiXpc)

The more data we have, the better performance we can achieve! It's easy with numerical data (see the lessons on Customer Churn and Anomaly Detection), but with texts it's a bit more complicated. We will see how to use word embeddings to do that.

First of all, let's go back to the spam classifier challenge of the 01-Processing-Text course. The aim is to improve your results of this exercice with text augmentation.

Remember, a spam classifier is a Machine Learning model that classifies texts (email or SMS) into two categories: spam (1) or ham (0).

To do that, we will reuse our knowledge: we will apply preprocessing and BOW or Tf-Idf on a dataset of texts.
Then we will use the logistic regression to predict to which class belong a new email/SMS, based on the BOW.

In [1]:
# TODO: Import NLTK and all the needed libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

stop = stopwords.words('english')

Load the dataset in *spam.csv* using pandas. Use the 'latin-1' encoding as loading option.

In [2]:
# TODO : load data
data = pd.read_csv('../input/spam.csv', encoding='latin-1')

Explore the dataset and check the balance of labels.

In [3]:
data

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ï¿½_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [4]:
# TODO : how many spams and how many hams ?
data.Class.value_counts()

ham     4825
spam     747
Name: Class, dtype: int64

Only 747 spams for 4825 hams, the datasets is a quite **unbalanced**.

Before dealing with this problem, perform a classification using logistic regression and a BOW or Tf-Idf and compute the F1-score on the minority class (spam) with a classification report. 

> ⚠️ Hint : lemmatize your texts and set a random state for your classifier. 

In [5]:
#Function to transform nltk pos_tags to WordNet pos_tags

def get_wordnet_pos(pos_tag):
    output = np.asarray(pos_tag)
    for i in range(len(pos_tag)):
        if pos_tag[i][1].startswith('J'):
            output[i][1] = wordnet.ADJ
        elif pos_tag[i][1].startswith('V'):
            output[i][1] = wordnet.VERB
        elif pos_tag[i][1].startswith('R'):
            output[i][1] = wordnet.ADV
        else:
            output[i][1] = wordnet.NOUN
    return output

In [6]:
# TODO : preprocessing
def preprocess(text):
    tokens = word_tokenize(text.lower())
    pos_tags = nltk.pos_tag(tokens)
    tags = get_wordnet_pos(pos_tags)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(tokens[i], tags[i,1]) for i in range(len(tokens)) if tokens[i].isalpha() and tokens[i] not in stop and len(tokens[i])>1]
    return tokens

In [7]:
# TODO : preprocessing
messages = data.Message.apply(preprocess)

In [8]:
messages

0       [go, jurong, point, crazy, available, bugis, g...
1                               [ok, lar, joke, wif, oni]
2       [free, entry, wkly, comp, win, fa, cup, final,...
3                    [dun, say, early, hor, already, say]
4             [nah, think, go, usf, live, around, though]
                              ...                        
5567    [time, try, contact, pound, prize, claim, easy...
5568                            [go, esplanade, fr, home]
5569                             [pity, mood, suggestion]
5570    [guy, bitching, act, like, interested, buy, so...
5571                                   [rofl, true, name]
Name: Message, Length: 5572, dtype: object

In [9]:
X = messages
y = data.Class

In [10]:
# TODO : split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y, random_state=42)

In [11]:
# TODO : BOW or TFIDF
vectorizer = TfidfVectorizer(analyzer=lambda x: x)
tfidf_train = vectorizer.fit_transform(X_train).toarray()

In [12]:
tfidf_test = vectorizer.transform(X_test)

In [13]:
tfidf_train.shape

(4457, 5535)

In [14]:
# TODO : logistic regression
logr = LogisticRegression(random_state=42)
logr.fit(tfidf_train, y_train)

LogisticRegression(random_state=42)

In [15]:
y_pred = logr.predict(tfidf_test)

In [16]:
y_pred_train = logr.predict(tfidf_train)

In [17]:
# TODO : check the F1-score on the minority class
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       0.99      0.73      0.84       149

    accuracy                           0.96      1115
   macro avg       0.98      0.87      0.91      1115
weighted avg       0.96      0.96      0.96      1115



In [18]:
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

         ham       0.97      1.00      0.98      3859
        spam       0.99      0.78      0.87       598

    accuracy                           0.97      4457
   macro avg       0.98      0.89      0.93      4457
weighted avg       0.97      0.97      0.97      4457



In [39]:
logr_bal = LogisticRegression(class_weight='balanced', random_state=42)
logr_bal.fit(tfidf_train, y_train)

LogisticRegression(class_weight='balanced', random_state=42)

In [40]:
y_pred_bal = logr_bal.predict(tfidf_test)

In [41]:
y_pred_train_bal = logr_bal.predict(tfidf_train)

In [42]:
print(classification_report(y_test, y_pred_bal))

              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       966
        spam       0.90      0.92      0.91       149

    accuracy                           0.97      1115
   macro avg       0.94      0.95      0.95      1115
weighted avg       0.98      0.97      0.98      1115



In [43]:
print(classification_report(y_train, y_pred_train_bal))

              precision    recall  f1-score   support

         ham       1.00      0.99      0.99      3859
        spam       0.93      0.99      0.96       598

    accuracy                           0.99      4457
   macro avg       0.96      0.99      0.98      4457
weighted avg       0.99      0.99      0.99      4457



The results are good, but can we do better ? We can try to **make the dataset less unbalanced**. We need to create new spams ! The naive approach would be to duplicate the spams, but this may not work very well and may simply generate overfitting. 

Instead, **we will use the word embeddings to find synonyms**. With synonyms we can generate new spams without duplicating the texts, so it's a little smarter.

How can we find synonyms with words embeddings ? If you have two words whose embeddings have a very high cosine similarity, you can assume they're synonymous. 

In the course we saw how to use the pre-trained Glove model containing 400000 words and their vector representation. The problem with this model is that if we have to find the closest word in the whole model we have to calculate 399999 consine similarity, which would take far too much time!

For this we will use another Glove model which allows us to do this much faster. 

First of all download the model from the Glove API. The following snippet of code does just that.

In [19]:
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-100")

With `model.word_vec()` we can display the vector representation of a word. Try with some words, how many dimensions does each vector have in this model ? 

In [20]:
model.__dict__['index2word']

['the',
 ',',
 '.',
 'of',
 'to',
 'and',
 'in',
 'a',
 '"',
 "'s",
 'for',
 '-',
 'that',
 'on',
 'is',
 'was',
 'said',
 'with',
 'he',
 'as',
 'it',
 'by',
 'at',
 '(',
 ')',
 'from',
 'his',
 "''",
 '``',
 'an',
 'be',
 'has',
 'are',
 'have',
 'but',
 'were',
 'not',
 'this',
 'who',
 'they',
 'had',
 'i',
 'which',
 'will',
 'their',
 ':',
 'or',
 'its',
 'one',
 'after',
 'new',
 'been',
 'also',
 'we',
 'would',
 'two',
 'more',
 "'",
 'first',
 'about',
 'up',
 'when',
 'year',
 'there',
 'all',
 '--',
 'out',
 'she',
 'other',
 'people',
 "n't",
 'her',
 'percent',
 'than',
 'over',
 'into',
 'last',
 'some',
 'government',
 'time',
 '$',
 'you',
 'years',
 'if',
 'no',
 'world',
 'can',
 'three',
 'do',
 ';',
 'president',
 'only',
 'state',
 'million',
 'could',
 'us',
 'most',
 '_',
 'against',
 'u.s.',
 'so',
 'them',
 'what',
 'him',
 'united',
 'during',
 'before',
 'may',
 'since',
 'many',
 'while',
 'where',
 'states',
 'because',
 'now',
 'city',
 'made',
 'like',
 

In [21]:
# TODO : how many dimensions in the embedding ?
len(model.word_vec('ok'))

100

With `model.most_similar('word', topn = 5)` we can find the 5 words that are the most similar (in terms of cosine similarity) to our given word. Try with with *house* and with *fox*. Is it always relevant?

In [22]:
# TODO : 5 most similar words to "house"
model.most_similar('house', topn=5)

[('office', 0.7581614851951599),
 ('senate', 0.7204986214637756),
 ('room', 0.7149738669395447),
 ('houses', 0.6888046264648438),
 ('capitol', 0.6851760149002075)]

In [23]:
# TODO : 5 most similar words to "fox"
model.most_similar('fox')[0][0]

'abc'

Now we will generate the new spams. To simplify the task, we will replace only the names. Names can be identified by their POS-tag 'NN' with `nltk.pos_tag`.

This is the way to do it:
- isolate the tokenized spams in a variable
- add the POS-tag to all spam tokens
- replace each token with the top 1 most similar word if 2 conditions are met: the POS-tag == 'NN' and the token has an embedding. 

> ⚠️ Hint : to verify that a word has a vector representation we can use `model.vocab`. 
<br>Example :

```python
"house" in model.vocab
>> True
```

- finaly, add new spams to the dataset 

In [24]:
# TODO : isolate the tokenized spams in a variable
token_spams = messages[y=='spam']

In [25]:
token_spams

2       [free, entry, wkly, comp, win, fa, cup, final,...
5       [freemsg, hey, darling, week, word, back, like...
8       [winner, value, network, customer, select, rec...
9       [mobile, month, entitle, update, late, colour,...
11      [six, chance, win, cash, pound, txt, send, cos...
                              ...                        
5537    [want, explicit, sex, sec, ring, cost, gsex, p...
5540    [ask, chatlines, inclu, free, min, india, cust...
5547    [contract, mobile, mnths, late, motorola, noki...
5566    [reminder, get, pound, free, call, credit, det...
5567    [time, try, contact, pound, prize, claim, easy...
Name: Message, Length: 747, dtype: object

In [26]:
# TODO : add the POS-tag to all spam tokens
pos_tags = token_spams.apply(nltk.pos_tag)

In [27]:
pos_tags

2       [(free, JJ), (entry, NN), (wkly, VBD), (comp, ...
5       [(freemsg, NN), (hey, NN), (darling, VBG), (we...
8       [(winner, NN), (value, NN), (network, NN), (cu...
9       [(mobile, JJ), (month, NN), (entitle, RB), (up...
11      [(six, CD), (chance, NN), (win, VB), (cash, NN...
                              ...                        
5537    [(want, NN), (explicit, NN), (sex, NN), (sec, ...
5540    [(ask, NN), (chatlines, NNS), (inclu, VBP), (f...
5547    [(contract, NN), (mobile, JJ), (mnths, NNS), (...
5566    [(reminder, NN), (get, VB), (pound, NN), (free...
5567    [(time, NN), (try, VB), (contact, JJ), (pound,...
Name: Message, Length: 747, dtype: object

In [28]:
# TODO : replace token with the top 1 most similar word if 2 conditions are met:
# the POS-tag == 'NN' and the token has an embedding.
token_spams[2][0]

'free'

In [29]:
 pos_tags[2][7]

('final', 'JJ')

In [30]:
new_spams = pos_tags.apply(lambda x : [model.most_similar(t[0], topn=1)[0][0] if t[1] == 'NN' and t[0] in model.vocab else t[0] for t in x])

In [31]:
new_spams

2       [free, enter, wkly, att, victory, fa, cup, fin...
5       [freemsg, yeah, darling, month, phrase, back, ...
8       [winners, price, networks, customers, selected...
9       [mobile, week, entitle, update, late, colour, ...
11      [six, chances, win, money, pounds, txt, send, ...
                              ...                        
5537    [do, implicit, sexual, ncaa, ring, costs, gsex...
5540    [asking, chatlines, inclu, free, kang, pakista...
5547    [contracts, mobile, mnths, late, motorola, nok...
5566    [reminders, get, pounds, free, calls, loans, d...
5567    [when, try, contact, pounds, nobel, claims, ea...
Name: Message, Length: 747, dtype: object

In [32]:
print(token_spams[2], new_spams[2])

['free', 'entry', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', 'may', 'text', 'fa', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'apply'] ['free', 'enter', 'wkly', 'att', 'victory', 'fa', 'cup', 'final', 'checkout', 'may', 'text', 'fa', 'receive', 'enter', 'questions', 'std', 'txt', 'rates', 'applying']


In [49]:
spams_tk_full = pd.concat([token_spams, new_spams], axis=0).reset_index(drop=True)

In [33]:
# TODO : add the newly generated spams to the dataset

In [75]:
y_new = pd.Series(np.zeros((747)), name='Class')

In [76]:
y_new = y_new.apply(lambda x: 'spam')

In [77]:
y_new

0      spam
1      spam
2      spam
3      spam
4      spam
       ... 
742    spam
743    spam
744    spam
745    spam
746    spam
Name: Class, Length: 747, dtype: object

In [78]:
data_new = pd.concat([new_spams.reset_index(drop=True), y_new], axis=1, )

In [79]:
data_new

Unnamed: 0,Message,Class
0,"[free, enter, wkly, att, victory, fa, cup, fin...",spam
1,"[freemsg, yeah, darling, month, phrase, back, ...",spam
2,"[winners, price, networks, customers, selected...",spam
3,"[mobile, week, entitle, update, late, colour, ...",spam
4,"[six, chances, win, money, pounds, txt, send, ...",spam
...,...,...
742,"[do, implicit, sexual, ncaa, ring, costs, gsex...",spam
743,"[asking, chatlines, inclu, free, kang, pakista...",spam
744,"[contracts, mobile, mnths, late, motorola, nok...",spam
745,"[reminders, get, pounds, free, calls, loans, d...",spam


In [80]:
data_aug = pd.concat([data, data_new], axis=0).reset_index(drop=True)

In [82]:
data_aug.Class.value_counts()

ham     4825
spam    1494
Name: Class, dtype: int64

In [84]:
X_aug = data.Message
y_aug = data.Class

In [85]:
# TODO : Split your data with the same random state as before and
# do a new prediction with the logistic regression and the same random state as before
X_train_aug, X_test_aug, y_train_aug, y_test_aug = train_test_split(X_aug, y_aug, test_size=.2, stratify=y_aug, random_state=42)

In [87]:
# TODO : Evaluate the new prediction on the minority class, is it better ?
vectorizer = TfidfVectorizer(analyzer=lambda x: x)
tfidf_train_aug = vectorizer.fit_transform(X_train_aug).toarray()

tfidf_test_aug = vectorizer.transform(X_test_aug).toarray()

logr_aug = LogisticRegression(class_weight='balanced', random_state=42)
logr_aug.fit(tfidf_train_aug, y_train_aug)

y_pred_aug = logr_aug.predict(tfidf_test_aug)
y_pred_train_aug = logr_aug.predict(tfidf_train_aug)

In [88]:
print(classification_report(y_test_aug, y_pred_aug))

              precision    recall  f1-score   support

         ham       0.98      0.99      0.99       966
        spam       0.91      0.89      0.90       149

    accuracy                           0.97      1115
   macro avg       0.95      0.94      0.94      1115
weighted avg       0.97      0.97      0.97      1115



In [89]:
print(classification_report(y_train_aug, y_pred_train_aug))

              precision    recall  f1-score   support

         ham       0.99      0.99      0.99      3859
        spam       0.93      0.94      0.94       598

    accuracy                           0.98      4457
   macro avg       0.96      0.97      0.96      4457
weighted avg       0.98      0.98      0.98      4457

