# 04-Spam-Classifier

![](https://images.unsplash.com/photo-1534770733765-337d273901c1?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1016&q=80)

Photo by [Franck V.](https://unsplash.com/photos/oIMXkEuiXpc)

It's time to make our first real Machine Learning application of NLP: a spam classifier!

A spam classifier is a Machine Learning model that classifier texts (email or SMS) into two categories: Spam (1) or legitimate (0).

To do that, we will reuse our knowledge: we will apply preprocessing and BOW (Bag Of Words) on a dataset of texts.
Then we will use a classifier to predict to which class belong a new email/SMS, based on the BOW.

First things first: import the needed libraries.

In [91]:
# TODO: Import NLTK and all the needed libraries
from nltk.tokenize import word_tokenize
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer


Load now the dataset in *spam.csv* using pandas. Use the 'latin-1' encoding as loading option.

In [92]:
# TODO: Load the dataset 
df = pd.read_csv('spam.csv')


As usual, I suggest you to explore a bit this dataset.

In [93]:
# TODO: explore the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [94]:
df.describe()

Unnamed: 0,Class,Message
count,5572,5572
unique,2,5170
top,ham,"Sorry, I'll call later"
freq,4825,30


In [95]:
# label : Class
df.Class.unique() # convert to 0/1
df.Class = df['Class'].apply(lambda x: int(x=='spam')) ## utiliser plutôt replace : à faire 
df.Class.unique()

array([0, 1])

In [96]:
df['Class'].value_counts()

0    4825
1     747
Name: Class, dtype: int64

In [97]:
def preprocessing(document):
    # 1- tokenization
    tokens = word_tokenize(document)
    # 2- punctuation removal
    tokens = [t.lower() for t in tokens if t.isalpha()]
    # 3- remove stopwords
    stop_words = stopwords.words('english')
    tokens = [t for t in tokens if not t in stop_words]
    # 4- lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

So as you see we have a column containing the labels, and a column containing the text to classify.

We will begin by doing the usual preprocessing: tokenization, punctuation removal and lemmatization.

In [98]:
# TODO: Perform preprocessing over all the text
vectorizer = CountVectorizer(analyzer=lambda x: x)
## Gérer le risque de data leakage: faire le split train/test avant de faire le preproc NLP


Ok now we have our preprocessed data. Next step is to do a BOW.

In [99]:
# TODO: compute the BOW
BOW = vectorizer.fit_transform([preprocessing(x) for x in df.Message]).toarray()
strings = vectorizer.get_feature_names_out()

In [100]:
df.shape

(5572, 2)

Then make a new dataframe as usual to have a visual idea of the words used and their frequencies.

In [101]:
# TODO: Make a new dataframe with the BOW
# dataframe from BOW with cols as words of vectorizer
df_bow = pd.DataFrame(data=BOW, columns=strings)  ## être sûr du phasage entre les valeurs et les colonnes
df_bow



Unnamed: 0,aa,aah,aaniye,aaooooright,aathi,ab,abbey,abdomen,abeg,abel,...,zebra,zed,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's check what is the most used word in the spam category and the non spam category.

There are two steps: first add the class to the BOW dataframe. Second, filter on a class, sum all the values and print the most frequent one.

In [102]:
# TODO: print the most used word in the spam and non spam category
# spam case
print(f"BOW in spam case: {df_bow[df['Class'] == 1].sum(axis=0).sort_values(ascending=False)[:10]}")
## voir pourquoi le mot le + fréquent dans spam est 'call' et pas 'free'

# non spam case
print(f"BOW in spam case: {df_bow[df['Class'] == 0].sum(axis=0).sort_values(ascending=False)[:10]}")


BOW in spam case: call      346
free      219
txt       156
ur        144
u         144
mobile    123
text      121
stop      116
claim     113
reply     104
dtype: int64
BOW in spam case: u       1002
gt       318
lt       316
get      301
ok       262
go       251
got      243
ur       243
know     237
like     233
dtype: int64


You should find that the most frequent spam word is 'free', not so surprising, right?

Now we can make a classifier based on our BOW. We will use a simple logistic regression here for the example.

You're an expert, you know what to do, right? Split the data, train your model, predict and see the performance.

In [111]:
# TODO: Perform a logistic regression to predict whether a message is a spam or not
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score,recall_score,precision_score
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer, KNNImputer


In [105]:
y = df['Class'].to_numpy()
X = df_bow

In [106]:
# il faut penser à stratify, a fortiori pour des données imbalanced
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42, stratify=y)

In [109]:
# normalisation sur train
X_train = StandardScaler().fit_transform(X_train)

# normalisation sur test
X_test = StandardScaler().fit_transform(X_test)

In [110]:
def scores_model(model, X_train, X_test, y_train, y_test):
    metrics = {}

    y_train_pred = model.predict(X_train)
    y_test_pred  = model.predict(X_test)

    ## F1-score
    metrics['f1score_train'] = f1score_train = f1_score(y_train, y_train_pred)
    metrics['f1score_test']  = f1score_test  = f1_score(y_test, y_test_pred)

    ## PRECISION
    metrics['precision_train'] = precision_train = precision_score(y_train,y_train_pred)
    metrics['precision_test']  = precision_test  = precision_score(y_test,y_test_pred)

    ## RECALL
    metrics['recall_train'] = recall_train = recall_score(y_train,y_train_pred)
    metrics['recall_test']  = recall_test  = recall_score(y_test,y_test_pred)

    return metrics

In [116]:
## je reviendrai + tard sur la partie pipeline
## car erreur: TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 
"""
steps=[
        ("imputer", KNNImputer()),
        ("scaler", MinMaxScaler()),
        ("dim_reducer", PCA(n_components=None))
    ]

pipe = make_pipeline(steps, LogisticRegression())
pipe.fit(X_train, y_train)
"""

param_dist = {
    'C': [0.1, 0.2],
     }

lr = LogisticRegression()
grid = RandomizedSearchCV(lr, param_dist, cv=3, n_iter=10, scoring = 'f1')
grid.fit(X_train, y_train)
print(grid.best_params_)

dict_scores_models = {}
dict_scores_lr = scores_model(grid, X_train, X_test, y_train, y_test)
dict_scores_models['LogisticRegression'] = dict_scores_lr
dict_scores_models

## scores très élevés: voir si dataleakage dans la partie NLP avec ntlk



{'C': 0.2}


{'LogisticRegression': {'f1score_train': 0.9991631799163181,
  'f1score_test': 0.8905109489051096,
  'precision_train': 1.0,
  'precision_test': 0.976,
  'recall_train': 0.9983277591973244,
  'recall_test': 0.8187919463087249}}

What precision do you get? Check by hand on some samples where it did predict well to check what could go wrong...

Try to use other models and try to improve your results.