# Challenge - Text-Augmentation

![](https://images.unsplash.com/photo-1534770733765-337d273901c1?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1016&q=80)

Photo by [Franck V.](https://unsplash.com/photos/oIMXkEuiXpc)

The more data we have, the better performance we can achieve! It's easy with numerical data (see the lessons on Customer Churn and Anomaly Detection), but with texts it's a bit more complicated. We will see how to use word embeddings to do that.

First of all, let's go back to the spam classifier challenge of the 01-Processing-Text course. The aim is to improve your results of this exercice with text augmentation.

Remember, a spam classifier is a Machine Learning model that classifies texts (email or SMS) into two categories: spam (1) or ham (0).

To do that, we will reuse our knowledge: we will apply preprocessing and BOW or Tf-Idf on a dataset of texts.
Then we will use the logistic regression to predict to which class belong a new email/SMS, based on the BOW.

In [None]:
# TODO: Import NLTK and all the needed libraries

Load the dataset in *spam.csv* using pandas. Use the 'latin-1' encoding as loading option.

In [None]:
# TODO : load data

Explore the dataset and check the balance of labels.

In [None]:
# TODO : how many spams and how many hams ?

Only 747 spams for 4825 hams, the datasets is a quite **unbalanced**.

Before dealing with this problem, perform a classification using logistic regression and a BOW or Tf-Idf and compute the F1-score on the minority class (spam) with a classification report. 

> ⚠️ Hint : lemmatize your texts and set a random state for your classifier. 

In [None]:
# TODO : preprocessing

In [None]:
# TODO : preprocessing

In [None]:
# TODO : split data

In [None]:
# TODO : BOW or TFIDF

In [None]:
# TODO : logistic regression

In [None]:
# TODO : check the F1-score on the minority class

The results are good, but can we do better ? We can try to **make the dataset less unbalanced**. We need to create new spams ! The naive approach would be to duplicate the spams, but this may not work very well and may simply generate overfitting. 

Instead, **we will use the word embeddings to find synonyms**. With synonyms we can generate new spams without duplicating the texts, so it's a little smarter.

How can we find synonyms with words embeddings ? If you have two words whose embeddings have a very high cosine similarity, you can assume they're synonymous. 

In the course we saw how to use the pre-trained Glove model containing 400000 words and their vector representation. The problem with this model is that if we have to find the closest word in the whole model we have to calculate 399999 consine similarity, which would take far too much time!

For this we will use another Glove model which allows us to do this much faster. 

First of all download the model from the Glove API. The following snippet of code does just that.

In [None]:
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-100")

With `model.word_vec()` we can display the vector representation of a word. Try with some words, how many dimensions does each vector have in this model ? 

In [None]:
# TODO : how many dimensions in the embedding ?

With `model.most_similar('word', topn = 5)` we can find the 5 words that are the most similar (in terms of cosine similarity) to our given word. Try with with *house* and with *fox*. Is it always relevant?

In [None]:
# TODO : 5 most similar words to "house"

In [None]:
# TODO : 5 most similar words to "fox"

Now we will generate the new spams. To simplify the task, we will replace only the names. Names can be identified by their POS-tag 'NN' with `nltk.pos_tag`.

This is the way to do it:
- isolate the tokenized spams in a variable
- add the POS-tag to all spam tokens
- replace each token with the top 1 most similar word if 2 conditions are met: the POS-tag == 'NN' and the token has an embedding. 

> ⚠️ Hint : to verify that a word has a vector representation we can use `model.vocab`. 
<br>Example :

```python
"house" in model.vocab
>> True
```

- finaly, add new spams to the dataset 

In [None]:
# TODO : isolate the tokenized spams in a variable

In [None]:
# TODO : add the POS-tag to all spam tokens

In [None]:
# TODO : replace token with the top 1 most similar word if 2 conditions are met:
# the POS-tag == 'NN' and the token has an embedding.

In [None]:
# TODO : compare a spam with this new version

In [None]:
# TODO : add the newly generated spams to the dataset

In [None]:
# TODO : add new labels to your `y` variable

In [None]:
# TODO : check the balance of your dataset. It should be a little less imablanced.

In [None]:
# TODO : Split your data with the same random state as before and
# do a new prediction with the logistic regression and the same random state as before

In [None]:
# TODO : Evaluate the new prediction on the minority class, is it better ?