# Identifier un SPAM

Le but de cette exercice est d'identifier automatiquement un spam ou un email normal 

Nous passerons par les étapes suivantes : 

1. Nettoyage de la données.
2. Convertir le texte nettoyé en une representation numérique.
3. Appliquer un model **Multinomial Naive Bayes**, pour classifier chaque un email en spam ou en non spam.

In [2]:
import string
import re


## La librairie NLTK (Natural Language Toolkit)

Installer et/ou importer la librairie NLTK

In [3]:
# !pip install nltk

In [4]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/laurianenathou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/laurianenathou/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/laurianenathou/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/laurianenathou/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [5]:
import pandas as pd

df = pd.read_csv("./data/emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [6]:
df.tail()

Unnamed: 0,text,spam
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0
5727,Subject: news : aurora 5 . 2 update aurora ve...,0


## (1) Cleaning du text du dataset

L'ensemble de données est composé d'e-mails classés comme ham [0] ou spam[1]. Nous devons nettoyer les données avant de former un modèle de prédiction.

### (1.1) Supprimer la Ponctuation

Créer une fonction pour supprimer la ponctuation. Appliquez-le à la colonne de texte et ajoutez la sortie à une nouvelle colonne dans le dataframe appelé clean_text


In [7]:
def delete_punct(email) :
    for punct in string.punctuation : 
        email = email.replace(punct, " ")
    return email

In [8]:
df["clean_text"] = df["text"].apply(lambda x : delete_punct(x))
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...,1,Subject the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...,1,Subject unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...,1,Subject 4 color printing special request add...
4,"Subject: do not have money , get software cds ...",1,Subject do not have money get software cds ...
...,...,...,...
5723,Subject: re : research and development charges...,0,Subject re research and development charges...
5724,"Subject: re : receipts from visit jim , than...",0,Subject re receipts from visit jim than...
5725,Subject: re : enron case study update wow ! a...,0,Subject re enron case study update wow a...
5726,"Subject: re : interest david , please , call...",0,Subject re interest david please call...


**La sortie devrait ressembler à ça :**

<img src='./images/1.png'>

### (1.2) Lower Case

Créer une fonction pour mettre le texte en minuscules. Nous l'appliquons à `clean_text`

In [9]:
def lower_case(text):
    return text.lower()
    

In [10]:
df["clean_text"] = df["clean_text"].apply(lambda x : lower_case(x))
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...,1,subject 4 color printing special request add...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds ...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject re research and development charges...
5724,"Subject: re : receipts from visit jim , than...",0,subject re receipts from visit jim than...
5725,Subject: re : enron case study update wow ! a...,0,subject re enron case study update wow a...
5726,"Subject: re : interest david , please , call...",0,subject re interest david please call...


**La sortie devrait ressembler à ça :**

<img src="./images/2.png">

### (1.3) Supprimer les nombres

Créer une fonction pour supprimer les nombres du texte. Nous l'appliquons à clean_text


In [11]:
def remove_digits(email) :
    email = re.sub(r'[0-9]', '', email)
    return email

In [12]:
df["clean_text"] = df["clean_text"].apply(lambda x : remove_digits(x))
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...,1,subject color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds ...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject re research and development charges...
5724,"Subject: re : receipts from visit jim , than...",0,subject re receipts from visit jim than...
5725,Subject: re : enron case study update wow ! a...,0,subject re enron case study update wow a...
5726,"Subject: re : interest david , please , call...",0,subject re interest david please call...


**La sortie devrait ressembler à ça :**

<img src="./images/3.png">

### (1.4) Supprimer les StopWords

Créer une fonction pour supprimer les mots vide de sens. Nous l'appliquons à clean_text.


In [13]:
# Utiliser ces outils
from nltk.corpus import stopwords 
from nltk import word_tokenize


In [14]:
def delete_stopwords(email):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(email)
    filtered_email = [w for w in word_tokens if not w in stop_words]
    # filtered_email = []
    # for w in word_tokens:
    #     if w not in stop_words:
    #         filtered_email.append(w)
    return filtered_email

In [15]:
df["clean_text"] = df["clean_text"].apply(lambda x : delete_stopwords(x))
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,"[subject, naturally, irresistible, corporate, ..."
1,Subject: the stock trading gunslinger fanny i...,1,"[subject, stock, trading, gunslinger, fanny, m..."
2,Subject: unbelievable new homes made easy im ...,1,"[subject, unbelievable, new, homes, made, easy..."
3,Subject: 4 color printing special request add...,1,"[subject, color, printing, special, request, a..."
4,"Subject: do not have money , get software cds ...",1,"[subject, money, get, software, cds, software,..."
...,...,...,...
5723,Subject: re : research and development charges...,0,"[subject, research, development, charges, gpg,..."
5724,"Subject: re : receipts from visit jim , than...",0,"[subject, receipts, visit, jim, thanks, invita..."
5725,Subject: re : enron case study update wow ! a...,0,"[subject, enron, case, study, update, wow, day..."
5726,"Subject: re : interest david , please , call...",0,"[subject, interest, david, please, call, shirl..."


**La sortie devrait ressembler à ça :**

<img src="./images/4.png">

### (1.5) Lemmatize

Créons une fonction pour lemmatiser le texte. Assurez-vous que la sortie est une chaîne unique, pas une liste de mots. Nous l'appliquons à clean_text


In [16]:
#Utiliser cet outil
from nltk.stem import WordNetLemmatizer

In [17]:
def lemmatiser(email):
  lemma = WordNetLemmatizer()
  res = []
  for w in email:
    res.append(lemma.lemmatize(w))
  return ' '.join(res)

In [18]:
df["clean_text"] = df["clean_text"].apply(lambda x : lemmatiser(x))
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible corporate ident...
1,Subject: the stock trading gunslinger fanny i...,1,subject stock trading gunslinger fanny merrill...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new home made easy im wan...
3,Subject: 4 color printing special request add...,1,subject color printing special request additio...
4,"Subject: do not have money , get software cds ...",1,subject money get software cd software compati...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject research development charge gpg forwar...
5724,"Subject: re : receipts from visit jim , than...",0,subject receipt visit jim thanks invitation vi...
5725,Subject: re : enron case study update wow ! a...,0,subject enron case study update wow day super ...
5726,"Subject: re : interest david , please , call...",0,subject interest david please call shirley cre...


**La sortie devrait ressembler à cela :**

<img src="./images/5.png">

## (2) CountVectorize

### (2.1) Numérisation des données textuelles en chiffres

Vectoriser clean_text avec un CountVectorize et laisser les paramatres par défault.

In [19]:
#Utiliser cet outil
from sklearn.feature_extraction.text import CountVectorizer

In [20]:
df.clean_text


0       subject naturally irresistible corporate ident...
1       subject stock trading gunslinger fanny merrill...
2       subject unbelievable new home made easy im wan...
3       subject color printing special request additio...
4       subject money get software cd software compati...
                              ...                        
5723    subject research development charge gpg forwar...
5724    subject receipt visit jim thanks invitation vi...
5725    subject enron case study update wow day super ...
5726    subject interest david please call shirley cre...
5727    subject news aurora update aurora version fast...
Name: clean_text, Length: 5728, dtype: object

In [21]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df.clean_text)
df_X = pd.DataFrame.sparse.from_spmatrix(X, index=None, columns=None)
df_X


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30923,30924,30925,30926,30927,30928,30929,30930,30931,30932
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**La sortie devrait ressembler à ça :**

<img src="./images/6.png">

### (2.2) Multinomial Naive Bayes Modelling

Cross valider le model et scorer le model avec l'accuracy metrique.

**Vous devriez obtenir un score proche de 1.**

In [22]:
#Utiliser ces outils
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate


In [23]:
X = df_X

In [24]:
y = df["spam"]

In [25]:
clf = MultinomialNB(force_alpha=True)
clf.fit(X, y)

In [26]:
cv_result = cross_validate(clf, X, y, cv=3)

In [27]:
cv_result['test_score']

array([0.98848168, 0.98847564, 0.99057098])