# 01 - Preprocessing

![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png)

Picture from [Wikipedia](https://en.wikipedia.org/wiki/SpaCy)

---

In this exercise, you will do some text preprocessing for a text classification task using the libraries NLTK and spaCy, in order to see the difference between both libraries.

We are going to work with the [SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). It contains around 5574 text messages which are classified as spam or ham (not-spam).

# 1. Load data

In [None]:
import numpy as np
import pandas as pd

We will start loading the data into a Pandas DataFrame:
> ⚠️ Do not forget to set the argument `header` to **None** and to feed in the list of columns names

In [None]:
columns_names = ['label', 'text']
# TODO:

The column we are interested in is **label**, it indicates if the text is a spam (label == 1) or not (label == 0).

Before starting our classification task, let's analyze the proportion of spams and hams:

> 🔦 Hint: you can use the Series method `.value_counts()` for that

In [None]:
# TODO:

You can notice that the classes are not balanced. We will use this information later in order to choose a correct evaluation metric. (💡 It's not accuracy in this case)

# 2. Text preprocessing comparison

Now, let's do some preprocessing and compare NLTK with spaCy!

## 2.1. Preprocess with NLTK

We are probably already familiarized with preprocessing in NLTK. Let's refresh the steps we usually follow when performing text preprocessing:
- Split the sentence in tokens (also called tokenization)
- Transform all tokens to lowercase
- Remove stop words, punctuations and digits
- Lemmatize (or Stemmize) the tokens

For comparison purposes between both libraries, we are not going to Stemmize the tokens in this exercise, because spaCy does not provide a Stemmer.

Using NLTK for preprocessing, write a function `preprocess_nltk` that takes a sentence as input and outputs the filtered sentence after the steps listed above:

In [None]:
def preprocess_nltk(sent):
    # TODO
    pass

Using the method `.apply()` on the column **text** create a new column on the dataframe called **preprocessed_text_nltk**:

In [None]:
%%time
# TODO

## 2.2. Preprocess with spacy

Now let's perform the text preprocessing with the spaCy library. Write a function `preprocess_spacy` that takes a sentence as input and outputs the filtered sentence:

❗️Do not forget to follow the same steps you followed with NLTK.
> Hint: with spaCy the function can be written in two lines! <br>
> Hint: you can just use the very light `English` model from spaCy

```python
from spacy.lang.en import English
```

In [None]:
def preprocess_spacy(sent):
    # TODO
    pass

Again, using the method `.apply()` on the column **text** create a new column on the dataframe called **preprocessed_text_spacy**:

In [None]:
%%time
# TODO

Can you notice the **easiness of usage** and the **improvement in time execution** when using spaCy instead of NLTK?

# 3. Text Classification - Spam detection

Finally, let's compare the effectiveness of preprocessing with each library by doing some text classification.

In order to evaluate our models properly, we need to split the data in train set and test set. For this exercise we will be using 25% of the data set for test.

In [1]:
# TODO : split data

Choose a text vectorizer (you can use [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) or [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from scikit-learn) and vectorize both preprocessed columns:

In [None]:
# TODO: Vectorize both preprocessed texts

## 3.1. Predicting on text preprocessed with NLTK

Now let's define a model to train and do some predictions. Feel free to choose the model you like the most.

After choosing your model, train it on the dataset preprocessed with **NLTK**.

In [None]:
# TODO

Now let's predict on the test set trained with **NLTK**:

In [None]:
# TODO

Finally evaluate the performance of the model predictions, using the [F1-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) metric.

This metric is often used when we deal with unbalanced classes.

In [None]:
# TODO

You can also evaluate the performance of unbalanced using a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html):

In [None]:
# TODO

##  3.2. Predicting on text preprocessed with spaCy

Now repeat the same steps above using the dataset preprocessed with spaCy and compare the results.

> Probably the results will be close. Although you might have a slighlty better result with NLTK, the objective of this exercise is to understand the **easiness of usage** and **time efficiency** of spaCy in comparison to NLTK

In [None]:
# TODO

## 3.3. (Optional) Improve your models

You can try to improve your classifiers by doing [Cross-Validation for hyperparameter tuning](https://scikit-learn.org/stable/modules/grid_search.html#grid-search) using a GridSearchCV or RandomizedSearchCV to select the best hyperparameters for your model

> 🔎 Hint: For datasets with balanced classes you might also use classifiers that outputs a probability for class 1 (in scikit-learn this is usually represented by the method `.predict_proba`) and fine-tune the threshold for decision.

In [None]:
# TODO