# Challenge - Text-Augmentation

![](https://images.unsplash.com/photo-1534770733765-337d273901c1?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1016&q=80)

Photo by [Franck V.](https://unsplash.com/photos/oIMXkEuiXpc)

The more data we have, the better performance we can achieve! It's easy with numerical data (see the lessons on Customer Churn and Anomaly Detection), but with texts it's a bit more complicated. We will see how to use word embeddings to do that.

First of all, let's go back to the spam classifier challenge of the 01-Processing-Text course. The aim is to improve your results of this exercice with text augmentation.

Remember, a spam classifier is a Machine Learning model that classifies texts (email or SMS) into two categories: spam (1) or ham (0).

To do that, we will reuse our knowledge: we will apply preprocessing and BOW or Tf-Idf on a dataset of texts.
Then we will use the logistic regression to predict to which class belong a new email/SMS, based on the BOW.

In [180]:
# TODO: Import NLTK and all the needed libraries
import gensim
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

Load the dataset in *spam.csv* using pandas. Use the 'latin-1' encoding as loading option.

In [181]:
# TODO : load data
df = pd.read_csv('../input/spam.csv',on_bad_lines='skip')

Explore the dataset and check the balance of labels.

In [182]:
# TODO : how many spams and how many hams ?
df.head()
df.Class.value_counts()
## imbalanced dataset: 4825 hams, 747 spams

ham     4825
spam     747
Name: Class, dtype: int64

Only 747 spams for 4825 hams, the datasets is a quite **unbalanced**.

Before dealing with this problem, perform a classification using logistic regression and a BOW or Tf-Idf and compute the F1-score on the minority class (spam) with a classification report. 

> ⚠️ Hint : lemmatize your texts and set a random state for your classifier. 

In [183]:
# TODO : preprocessing
df['Class'] = df['Class'].replace({'ham': 0, 'spam': 1})
df.Class.value_counts()

0    4825
1     747
Name: Class, dtype: int64

In [184]:
# TODO : preprocessing
def preprocessing(document):
    # 1- tokenization
    tokens = word_tokenize(document)
    # 2- punctuation removal
    tokens = [t.lower() for t in tokens if t.isalpha()]
    # 3- remove stopwords
    stop_words = stopwords.words('english')
    tokens = [t for t in tokens if not t in stop_words]
    # 4- lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

In [185]:
df["Message"] = df.Message.apply(preprocessing)
df.head()

Unnamed: 0,Class,Message
0,0,"[go, jurong, point, crazy, available, bugis, n..."
1,0,"[ok, lar, joking, wif, u, oni]"
2,1,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,0,"[u, dun, say, early, hor, u, c, already, say]"
4,0,"[nah, think, goes, usf, lives, around, though]"


In [186]:
# Build X and y
y = df['Class'].to_numpy()
X = df['Message']

In [187]:
# TODO : split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42,stratify=y)
print(f'X_train.shape: {X_train.shape}')
print(f'X_test.shape: {X_test.shape}')
print(f'y_train.shape: {y_train.shape}')
print(f'y_test.shape: {y_test.shape}')

X_train.shape: (3733,)
X_test.shape: (1839,)
y_train.shape: (3733,)
y_test.shape: (1839,)


In [188]:
# TODO : BOW or TFIDF
# TFIDF trained on train subset
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)
tf_idf_train = vectorizer.fit_transform(X_train).toarray()
tf_idf_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [189]:
## TEST: transform only on Test
tf_idf_test = vectorizer.transform(X_test).toarray()

In [190]:
# TODO : logistic regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=100)
lr.fit(tf_idf_train, y_train)

# prédictions sur Train et Test
y_pred_train = lr.predict(tf_idf_train)
y_pred_test  = lr.predict(tf_idf_test)

In [191]:
# TODO : check the F1-score on the minority class
from sklearn.metrics import f1_score,recall_score,precision_score
# F1-score de Train sur ham et spam
vect_f1_score_train = f1_score(y_train, y_pred_train, average=None)
print(f" F1-score sur Train - Ham: {vect_f1_score_train[0]}")
print(f" F1-score sur Train - Spam: {vect_f1_score_train[1]}")
# F1-score de Test
vect_f1_score_test = f1_score(y_test, y_pred_test, average=None)
print(f" F1-score sur Test - Ham: {vect_f1_score_test[0]}")
print(f" F1-score sur Test - Spam: {vect_f1_score_test[1]}")
## F1-score sur Test - Spam: 0.7942583732057416


 F1-score sur Train - Ham: 0.9790718835304824
 F1-score sur Train - Spam: 0.841743119266055
 F1-score sur Test - Ham: 0.9736196319018404
 F1-score sur Test - Spam: 0.7942583732057416


The results are good, but can we do better ? We can try to **make the dataset less unbalanced**. We need to create new spams ! The naive approach would be to duplicate the spams, but this may not work very well and may simply generate overfitting. 

Instead, **we will use the word embeddings to find synonyms**. With synonyms we can generate new spams without duplicating the texts, so it's a little smarter.

How can we find synonyms with words embeddings ? If you have two words whose embeddings have a very high cosine similarity, you can assume they're synonymous. 

In the course we saw how to use the pre-trained Glove model containing 400000 words and their vector representation. The problem with this model is that if we have to find the closest word in the whole model we have to calculate 399999 consine similarity, which would take far too much time!

For this we will use another Glove model which allows us to do this much faster. 

First of all download the model from the Glove API. The following snippet of code does just that.

In [192]:
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-100")

In [193]:
model

<gensim.models.keyedvectors.KeyedVectors at 0x7fd3ee4903d0>

With `model.word_vec()` we can display the vector representation of a word. Try with some words, how many dimensions does each vector have in this model ? 

In [194]:
# TODO : how many dimensions in the embedding ?
# Then you can get the embedding of a word as a numpy array this way
word_embedding = model.word_vec('present')
word_embedding

  word_embedding = model.word_vec('present')


array([-0.19623 ,  0.26213 ,  0.46284 ,  0.23267 , -0.21188 ,  0.051022,
       -0.28305 ,  0.39192 ,  0.13012 ,  0.071752, -0.18406 ,  0.084196,
        0.34822 ,  0.18919 ,  0.62453 , -0.55918 ,  0.19213 , -0.017218,
       -0.55026 , -0.02437 , -0.34945 , -0.40632 ,  0.33808 , -0.043692,
       -0.3734  , -0.85703 ,  0.54926 ,  0.009388,  0.49106 , -0.17067 ,
        0.16493 ,  0.32655 ,  0.072014, -0.19438 ,  0.10654 ,  0.46349 ,
        0.61987 , -0.13063 , -0.20617 , -0.12397 , -0.47928 , -0.04128 ,
        0.34817 , -0.14162 ,  0.38493 ,  0.11754 ,  0.037054, -0.36628 ,
       -0.58683 , -0.64051 ,  0.42736 , -0.30208 ,  0.70955 ,  0.7739  ,
       -0.020319, -2.2776  , -0.39541 , -0.62158 ,  1.0597  ,  0.18122 ,
       -0.34483 ,  0.90861 , -0.15188 ,  0.0294  ,  0.79108 , -0.21084 ,
        0.1694  , -0.043356,  0.21957 , -0.36199 ,  0.30797 , -0.061459,
       -0.039168, -0.442   ,  0.40566 , -0.155   , -0.35937 , -0.28062 ,
       -0.79322 , -0.40449 , -0.23335 ,  0.33136 , 

With `model.most_similar('word', topn = 5)` we can find the 5 words that are the most similar (in terms of cosine similarity) to our given word. Try with with *house* and with *fox*. Is it always relevant?

In [195]:
# TODO : 5 most similar words to "house"
model.most_similar('house', topn = 5)

[('office', 0.7581615447998047),
 ('senate', 0.7204986810684204),
 ('room', 0.7149738669395447),
 ('houses', 0.6888046264648438),
 ('capitol', 0.6851760149002075)]

In [196]:
# TODO : 5 most similar words to "fox"
model.most_similar('fox', topn = 5)

[('abc', 0.7963388562202454),
 ('nbc', 0.7775211334228516),
 ('cbs', 0.7563193440437317),
 ('television', 0.7360690236091614),
 ('tv', 0.708899736404419)]

Now we will generate the new spams. To simplify the task, we will replace only the names. Names can be identified by their POS-tag 'NN' with `nltk.pos_tag`.

This is the way to do it:
- isolate the tokenized spams in a variable
- add the POS-tag to all spam tokens
- replace each token with the top 1 most similar word if 2 conditions are met: the POS-tag == 'NN' and the token has an embedding. 

> ⚠️ Hint : to verify that a word has a vector representation we can use `model.vocab`. 
<br>Example :

```python
"house" in model.vocab
>> True
```

- finaly, add new spams to the dataset 

### Data Leakage si on le fait avant le split train-test
### il faut en fait ajouter les new_spams au X_train uniquement: voir ce qui est fait dans EXO-2 Quora

In [197]:
# TODO : isolate the tokenized spams in a variable
tokenized_spams = df[df['Class']==1].Message
tokenized_spams 
##df.Class.value_counts()

2       [free, entry, wkly, comp, win, fa, cup, final,...
5       [freemsg, hey, darling, week, word, back, like...
8       [winner, valued, network, customer, selected, ...
9       [mobile, months, u, r, entitled, update, lates...
11      [six, chances, win, cash, pounds, txt, send, c...
                              ...                        
5537    [want, explicit, sex, secs, ring, costs, gsex,...
5540    [asked, chatlines, inclu, free, mins, india, c...
5547    [contract, mobile, mnths, latest, motorola, no...
5566    [reminder, get, pounds, free, call, credit, de...
5567    [time, tried, contact, u, pound, prize, claim,...
Name: Message, Length: 747, dtype: object

In [198]:
# TODO : add the POS-tag to all spam tokens
from nltk.tag import pos_tag

# list if list of tagged tokend
#pos_tag_spams = [pos_tag(spam) for spam in tokenized_spams]
#pos_tag_spams

pos_tag_spams = tokenized_spams.apply(nltk.pos_tag)
pos_tag_spams

2       [(free, JJ), (entry, NN), (wkly, VBD), (comp, ...
5       [(freemsg, NN), (hey, NN), (darling, VBG), (we...
8       [(winner, NN), (valued, VBN), (network, NN), (...
9       [(mobile, JJ), (months, NNS), (u, JJ), (r, NN)...
11      [(six, CD), (chances, NNS), (win, VBP), (cash,...
                              ...                        
5537    [(want, NN), (explicit, NN), (sex, NN), (secs,...
5540    [(asked, VBN), (chatlines, NNS), (inclu, RB), ...
5547    [(contract, NN), (mobile, JJ), (mnths, NNS), (...
5566    [(reminder, NN), (get, VB), (pounds, NNS), (fr...
5567    [(time, NN), (tried, VBN), (contact, NN), (u, ...
Name: Message, Length: 747, dtype: object

In [199]:
word_embedding = model.get_vector('present')
word_embedding

array([-0.19623 ,  0.26213 ,  0.46284 ,  0.23267 , -0.21188 ,  0.051022,
       -0.28305 ,  0.39192 ,  0.13012 ,  0.071752, -0.18406 ,  0.084196,
        0.34822 ,  0.18919 ,  0.62453 , -0.55918 ,  0.19213 , -0.017218,
       -0.55026 , -0.02437 , -0.34945 , -0.40632 ,  0.33808 , -0.043692,
       -0.3734  , -0.85703 ,  0.54926 ,  0.009388,  0.49106 , -0.17067 ,
        0.16493 ,  0.32655 ,  0.072014, -0.19438 ,  0.10654 ,  0.46349 ,
        0.61987 , -0.13063 , -0.20617 , -0.12397 , -0.47928 , -0.04128 ,
        0.34817 , -0.14162 ,  0.38493 ,  0.11754 ,  0.037054, -0.36628 ,
       -0.58683 , -0.64051 ,  0.42736 , -0.30208 ,  0.70955 ,  0.7739  ,
       -0.020319, -2.2776  , -0.39541 , -0.62158 ,  1.0597  ,  0.18122 ,
       -0.34483 ,  0.90861 , -0.15188 ,  0.0294  ,  0.79108 , -0.21084 ,
        0.1694  , -0.043356,  0.21957 , -0.36199 ,  0.30797 , -0.061459,
       -0.039168, -0.442   ,  0.40566 , -0.155   , -0.35937 , -0.28062 ,
       -0.79322 , -0.40449 , -0.23335 ,  0.33136 , 

In [200]:
# TODO : replace token with the top 1 most similar word if 2 conditions are met:
# the POS-tag == 'NN' and the token has an embedding.
## Pb avec l'error: AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
vocab = model.key_to_index
vocab
def replace_to_synonyme(tagged_tokens):
    ## issue with model.vocab
    tokens = [t_pos[0] if t_pos[1]!='NN' or t_pos[0] not in vocab
              else model.most_similar(t_pos[0], topn = 1)[0][0] for t_pos in tagged_tokens]
    return tokens

synonymes_tokens = pos_tag_spams.apply(replace_to_synonyme)

"\nsynonymes_tokens = [\n    [t_pos[0] if t_pos[1] != 'NN' else (model.most_similar(t_pos[0], topn = 1)[0][0], 'NN')\n     for t_pos in tokenized_spams_elt] for tokenized_spams_elt in tokenized_spams\n]\nsynonymes_tokens\n"

In [201]:
synonymes_tokens

2       [free, enter, wkly, att, victory, fa, cup, fin...
5       [freemsg, yeah, darling, month, phrase, back, ...
8       [winners, valued, networks, customers, selecte...
9       [mobile, months, u, d, entitled, update, lates...
11      [six, chances, win, money, pounds, txt, send, ...
                              ...                        
5537    [do, implicit, sexual, 1min, rings, costs, gse...
5540    [asked, chatlines, inclu, free, mins, india, h...
5547    [contracts, mobile, mnths, latest, nokia, eric...
5566    [reminders, get, pounds, free, calls, loans, d...
5567    [when, tried, contacts, u, pounds, nobel, clai...
Name: Message, Length: 747, dtype: object

In [202]:
# TODO : compare a spam with this new version
print(tokenized_spams.iloc[50])
print(synonymes_tokens.iloc[50])

['fancy',
 'shag',
 'txt',
 'xxuk',
 'suzy',
 'txts',
 'cost',
 'per',
 'msg',
 'tncs',
 'website',
 'x']

['fancy',
 'carpeting',
 'obj',
 'xxuk',
 'veronica',
 'txts',
 'costs',
 'per',
 'msg',
 'shgs',
 'web',
 'g']

#### AJOUTER les ROWS associées à synonymes_tokens et 0/1 dans y 

In [204]:
# TODO : add the newly generated spams to the dataset
X_more = X.append(pd.Series(synonymes_tokens), ignore_index=True)
X_more

  X_more = X.append(pd.Series(synonymes_tokens), ignore_index=True)


0       [go, jurong, point, crazy, available, bugis, n...
1                          [ok, lar, joking, wif, u, oni]
2       [free, entry, wkly, comp, win, fa, cup, final,...
3           [u, dun, say, early, hor, u, c, already, say]
4          [nah, think, goes, usf, lives, around, though]
                              ...                        
6314    [do, implicit, sexual, 1min, rings, costs, gse...
6315    [asked, chatlines, inclu, free, mins, india, h...
6316    [contracts, mobile, mnths, latest, nokia, eric...
6317    [reminders, get, pounds, free, calls, loans, d...
6318    [when, tried, contacts, u, pounds, nobel, clai...
Name: Message, Length: 6319, dtype: object

In [205]:
type(X_more)

pandas.core.series.Series

In [206]:
# TODO : add new labels to your `y` variable
## all class==1 as they are new spams
y_more = pd.Series(y)
y_more = y_more.append(pd.Series(np.ones(len(synonymes_tokens))), ignore_index=True)
y_more


  y_more = y_more.append(pd.Series(np.ones(len(synonymes_tokens))), ignore_index=True)


0       0.0
1       0.0
2       1.0
3       0.0
4       0.0
       ... 
6314    1.0
6315    1.0
6316    1.0
6317    1.0
6318    1.0
Length: 6319, dtype: float64

In [207]:
X_train, X_test, y_train, y_test = train_test_split(X_more, y_more, test_size=0.33, random_state=42,stratify=y_more)
print(f'X_train.shape: {X_train.shape}')
print(f'X_test.shape: {X_test.shape}')
print(f'y_train.shape: {y_train.shape}')
print(f'y_test.shape: {y_test.shape}')

X_train.shape: (4233,)
X_test.shape: (2086,)
y_train.shape: (4233,)
y_test.shape: (2086,)


In [208]:
# TODO : check the balance of your dataset. It should be a little less imablanced.
### pourquoi le ds serait moins imbalanced ? à voir
print(len(y_more[y_more==0]))
print(len(y_more[y_more==1]))
# OK better balanced 

4825
1494


In [209]:
X_train

1319                               [correct, work, today]
1983    [wnt, buy, bmw, car, urgently, vry, hv, shorta...
3840                         [howz, come, said, medicine]
5460    [december, mobile, entitled, update, latest, c...
2318                                    [way, office, da]
                              ...                        
3196    [poking, man, everyday, teach, canada, abi, sa...
1967       [even, cant, close, eyes, vava, playing, umma]
5326                                       [makes, happy]
2582    [free, tarot, texts, find, love, life, try, fr...
5584    [valued, customers, pleased, advising, followi...
Name: Message, Length: 4233, dtype: object

In [210]:
y_train

1319    0.0
1983    0.0
3840    0.0
5460    1.0
2318    0.0
       ... 
3196    0.0
1967    0.0
5326    0.0
2582    1.0
5584    1.0
Length: 4233, dtype: float64

In [214]:
#TF-IDF again
# TFIDF trained on train subset
vectorizer = TfidfVectorizer(analyzer=lambda x: x)
tf_idf_train = vectorizer.fit_transform(X_train).toarray()
tf_idf_test = vectorizer.transform(X_test).toarray()

print(tf_idf_train.shape)
print(y_train.shape)


(4233, 6278)
(4233,)


In [215]:
# TODO : Split your data with the same random state as before (fait ci-dessus) and
# do a new prediction with the logistic regression and the same random state as before
lr = LogisticRegression(max_iter=100)
lr.fit(tf_idf_train, y_train)

# prédictions sur Train et Test
y_pred_train = lr.predict(tf_idf_train)
y_pred_test  = lr.predict(tf_idf_test)

## F1-score test sur spams: 0.89

In [216]:
# TODO : Evaluate the new prediction on the minority class, is it better ?
# F1-score de Train sur ham et spam
vect_f1_score_train = f1_score(y_train, y_pred_train, average=None)
print(f" F1-score sur Train - Ham: {vect_f1_score_train[0]}")
print(f" F1-score sur Train - Spam: {vect_f1_score_train[1]}")
# F1-score de Test
vect_f1_score_test = f1_score(y_test, y_pred_test, average=None)
print(f" F1-score sur Test - Ham: {vect_f1_score_test[0]}")
print(f" F1-score sur Test - Spam: {vect_f1_score_test[1]}")
## F1-score sur Test - Spam: 0.7942583732057416

 F1-score sur Train - Ham: 0.9799270072992701
 F1-score sur Train - Spam: 0.9301587301587302
 F1-score sur Test - Ham: 0.969122592479364
 F1-score sur Test - Spam: 0.8879023307436182
