# 01 - Preprocessing

![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png)

Picture from [Wikipedia](https://en.wikipedia.org/wiki/SpaCy)

In [79]:
###! pip install -U spacy: fait depuis le Ter

In [80]:
import spacy
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

In [65]:
# Create the nlp object -- this object will parse the text and preprocess it automatically
nlp = spacy.load("en_core_web_md")

---

In this exercise, you will do some text preprocessing for a text classification task using the libraries NLTK and spaCy, in order to see the difference between both libraries.

We are going to work with the [SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). It contains around 5574 text messages which are classified as spam or ham (not-spam).

# 1. Load data

In [66]:
import numpy as np
import pandas as pd

We will start loading the data into a Pandas DataFrame:
> ⚠️ Do not forget to set the argument `header` to **None** and to feed in the list of columns names

In [67]:
columns_names = ['label', 'text']
# TODO:
df = pd.read_csv('../sms+spam+collection/SMSSpamCollection.csv',sep='\t',header=None)
df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [68]:
# affect columns
df = df.rename(columns={0:columns_names[0], 1:columns_names[1]})
df.columns
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


The column we are interested in is **label**, it indicates if the text is a spam (label == 1) or not (label == 0).

Before starting our classification task, let's analyze the proportion of spams and hams:

> 🔦 Hint: you can use the Series method `.value_counts()` for that

In [69]:
# TODO:
df.label.value_counts()
# imbalanced dataset 4825 / 747

ham     4825
spam     747
Name: label, dtype: int64

In [70]:
df.replace({'ham':0, 'spam':1}, inplace=True)
df.label.value_counts()

0    4825
1     747
Name: label, dtype: int64

You can notice that the classes are not balanced. We will use this information later in order to choose a correct evaluation metric. (💡 It's not accuracy in this case)

# 2. Text preprocessing comparison

Now, let's do some preprocessing and compare NLTK with spaCy!

## 2.1. Preprocess with NLTK

We are probably already familiarized with preprocessing in NLTK. Let's refresh the steps we usually follow when performing text preprocessing:
- Split the sentence in tokens (also called tokenization)
- Transform all tokens to lowercase
- Remove stop words, punctuations and digits
- Lemmatize (or Stemmize) the tokens

For comparison purposes between both libraries, we are not going to Stemmize the tokens in this exercise, because spaCy does not provide a Stemmer.

Using NLTK for preprocessing, write a function `preprocess_nltk` that takes a sentence as input and outputs the filtered sentence after the steps listed above:

In [71]:
def preprocess_nltk(sent):
    # 1- tokenization
    tokens = word_tokenize(sent)
    # 2- lower case of strings
    tokens = [t.lower() if t.isalpha() else t for t in tokens]
    # 3- remove stopwords
    stop_words = stopwords.words('english')
    tokens = [t for t in tokens if not t in stop_words]
    # 4- Stemming
    stemmer = PorterStemmer() #build root by removing some known suffix and prefix
    tokens = [stemmer.stem(w) for w in tokens]
    # 4- lemmatization
    #lemmatizer = WordNetLemmatizer()
    #tokens_lem = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

Using the method `.apply()` on the column **text** create a new column on the dataframe called **preprocessed_text_nltk**:

In [72]:
%%time
df['preprocessed_text_nltk'] = df.text.apply(preprocess_nltk)
df.head()

CPU times: user 2.04 s, sys: 58 ms, total: 2.1 s
Wall time: 2.41 s


Unnamed: 0,label,text,preprocessed_text_nltk
0,0,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, ,, crazi, .., avail, bugi,..."
1,0,Ok lar... Joking wif u oni...,"[ok, lar, ..., joke, wif, u, oni, ...]"
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entri, 2, wkli, comp, win, fa, cup, fin..."
3,0,U dun say so early hor... U c already then say...,"[u, dun, say, earli, hor, ..., u, c, alreadi, ..."
4,0,"Nah I don't think he goes to usf, he lives aro...","[nah, n't, think, goe, usf, ,, live, around, t..."


## 2.2. Preprocess with spacy

Now let's perform the text preprocessing with the spaCy library. Write a function `preprocess_spacy` that takes a sentence as input and outputs the filtered sentence:

❗️Do not forget to follow the same steps you followed with NLTK.
> Hint: with spaCy the function can be written in two lines! <br>
> Hint: you can just use the very light `English` model from spaCy

```python
from spacy.lang.en import English
```

In [73]:
from spacy.lang.en import English

In [74]:
nlp = English()

In [75]:
def preprocess_spacy(sent):
    # Document created by preprocessing the text with the nlp object
    doc = nlp(sent)
    tokens = [t.text for t in doc]
    return tokens

Again, using the method `.apply()` on the column **text** create a new column on the dataframe called **preprocessed_text_spacy**:

In [76]:
%%time
df['preprocessed_text_spacy'] = df.text.apply(preprocess_spacy)
df.head()

CPU times: user 530 ms, sys: 27 ms, total: 557 ms
Wall time: 654 ms


Unnamed: 0,label,text,preprocessed_text_nltk,preprocessed_text_spacy
0,0,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, ,, crazi, .., avail, bugi,...","[Go, until, jurong, point, ,, crazy, .., Avail..."
1,0,Ok lar... Joking wif u oni...,"[ok, lar, ..., joke, wif, u, oni, ...]","[Ok, lar, ..., Joking, wif, u, oni, ...]"
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entri, 2, wkli, comp, win, fa, cup, fin...","[Free, entry, in, 2, a, wkly, comp, to, win, F..."
3,0,U dun say so early hor... U c already then say...,"[u, dun, say, earli, hor, ..., u, c, alreadi, ...","[U, dun, say, so, early, hor, ..., U, c, alrea..."
4,0,"Nah I don't think he goes to usf, he lives aro...","[nah, n't, think, goe, usf, ,, live, around, t...","[Nah, I, do, n't, think, he, goes, to, usf, ,,..."


Can you notice the **easiness of usage** and the **improvement in time execution** when using spaCy instead of NLTK?

# 3. Text Classification - Spam detection

Finally, let's compare the effectiveness of preprocessing with each library by doing some text classification.

In order to evaluate our models properly, we need to split the data in train set and test set. For this exercise we will be using 25% of the data set for test.

### Avec Preproc NLTK

In [77]:
# Build X, y
X = df['preprocessed_text_nltk']
y = df['label']

In [78]:
# TODO : split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42,stratify=y)
print(f'X_train.shape: {X_train.shape}')
print(f'X_test.shape: {X_test.shape}')
print(f'y_train.shape: {y_train.shape}')
print(f'y_test.shape: {y_test.shape}')

X_train.shape: (4179,)
X_test.shape: (1393,)
y_train.shape: (4179,)
y_test.shape: (1393,)


Choose a text vectorizer (you can use [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) or [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from scikit-learn) and vectorize both preprocessed columns:

In [57]:
# TODO: Vectorize both preprocessed texts: TfidfVectorizer
# TFIDF trained on train subset
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)
tf_idf_train = vectorizer.fit_transform(X_train).toarray()
tf_idf_train

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.10511611, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.11670479, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [58]:
## TEST: transform only on Test
tf_idf_test = vectorizer.transform(X_test).toarray()

## 3.1. Predicting on text preprocessed with NLTK

Now let's define a model to train and do some predictions. Feel free to choose the model you like the most.

After choosing your model, train it on the dataset preprocessed with **NLTK**.

In [59]:
# Logistic regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=100)
lr.fit(tf_idf_train, y_train)


Now let's predict on the test set trained with **NLTK**:

In [60]:
# prédictions sur Train et Test
y_pred_train = lr.predict(tf_idf_train)
y_pred_test  = lr.predict(tf_idf_test)

Finally evaluate the performance of the model predictions, using the [F1-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) metric.

This metric is often used when we deal with unbalanced classes.

In [61]:
# F1-score de Test
from sklearn.metrics import f1_score

vect_f1_score_test = f1_score(y_test, y_pred_test, average=None)
print(f" F1-score sur Test - spam: {vect_f1_score_test[0]}")
print(f" F1-score sur Test - ham: {vect_f1_score_test[1]}")

 F1-score sur Test - spam: 0.9804878048780488
 F1-score sur Test - ham: 0.8527607361963191


You can also evaluate the performance of unbalanced using a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html):

In [None]:
# TODO

##  3.2. Predicting on text preprocessed with spaCy

Now repeat the same steps above using the dataset preprocessed with spaCy and compare the results.

> Probably the results will be close. Although you might have a slighlty better result with NLTK, the objective of this exercise is to understand the **easiness of usage** and **time efficiency** of spaCy in comparison to NLTK

In [62]:
X = df['preprocessed_text_spacy']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42,stratify=y)
print(f'X_train.shape: {X_train.shape}')
print(f'X_test.shape: {X_test.shape}')
print(f'y_train.shape: {y_train.shape}')
print(f'y_test.shape: {y_test.shape}')

# TFIDF trained on train subset
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)
tf_idf_train = vectorizer.fit_transform(X_train).toarray()
tf_idf_train

## TEST: transform only on Test
tf_idf_test = vectorizer.transform(X_test).toarray()

# Logistic regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=100)
lr.fit(tf_idf_train, y_train)

# prédictions sur Train et Test
y_pred_train = lr.predict(tf_idf_train)
y_pred_test  = lr.predict(tf_idf_test)

vect_f1_score_test = f1_score(y_test, y_pred_test, average=None)
print(f" F1-score sur Test - spam: {vect_f1_score_test[0]}")
print(f" F1-score sur Test - ham: {vect_f1_score_test[1]}")

# mieux avec le preproc Spacy 0.88 vs 0.85

X_train.shape: (4179,)
X_test.shape: (1393,)
y_train.shape: (4179,)
y_test.shape: (1393,)
 F1-score sur Test - spam: 0.9840881272949816
 F1-score sur Test - ham: 0.8835820895522388


## 3.3. (Optional) Improve your models

You can try to improve your classifiers by doing [Cross-Validation for hyperparameter tuning](https://scikit-learn.org/stable/modules/grid_search.html#grid-search) using a GridSearchCV or RandomizedSearchCV to select the best hyperparameters for your model

> 🔎 Hint: For datasets with balanced classes you might also use classifiers that outputs a probability for class 1 (in scikit-learn this is usually represented by the method `.predict_proba`) and fine-tune the threshold for decision.

In [None]:
# TODO