# Identifier un SPAM

Le but de cette exercice est d'identifier automatiquement un spam ou un email normal 

Nous passerons par les étapes suivantes : 

1. Nettoyage de la données.
2. Convertir le texte nettoyé en une representation numérique.
3. Appliquer un model **Multinomial Naive Bayes**, pour classifier chaque un email en spam ou en non spam.

## La librairie NLTK (Natural Language Toolkit)

Installer et/ou importer la librairie NLTK

In [1]:
# !pip install nltk

In [2]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/camille/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/camille/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/camille/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/camille/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [3]:
import pandas as pd

df = pd.read_csv("./data/emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [4]:
df.tail()

Unnamed: 0,text,spam
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0
5727,Subject: news : aurora 5 . 2 update aurora ve...,0


## (1) Cleaning du text du dataset

L'ensemble de données est composé d'e-mails classés comme ham [0] ou spam[1]. Nous devons nettoyer les données avant de former un modèle de prédiction.

### (1.1) Supprimer la Ponctuation

Créer une fonction pour supprimer la ponctuation. Appliquez-le à la colonne de texte et ajoutez la sortie à une nouvelle colonne dans le dataframe appelé clean_text


**La sortie devrait ressembler à ça :**

<img src='./images/1.png'>

In [5]:
import string


def remove_punctuation(text):
    clean_text = text.translate(str.maketrans('', '', string.punctuation))
    return clean_text


df['clean_text'] = df['text'].apply(remove_punctuation)

df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,Subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,Subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,Subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,Subject do not have money get software cds fr...


### (1.2) Lower Case

Créer une fonction pour mettre le texte en minuscules. Nous l'appliquons à `clean_text`

**La sortie devrait ressembler à ça :**

<img src="./images/2.png">

In [6]:
def to_lowercase(text):
    lowercase_text = text.lower()
    return lowercase_text

df['clean_text'] = df['clean_text'].apply(to_lowercase)

df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...


### (1.3) Supprimer les nombres

Créer une fonction pour supprimer les nombres du texte. Nous l'appliquons à clean_text


**La sortie devrait ressembler à ça :**

<img src="./images/3.png">

In [7]:
def remove_numbers(text):
    clean_text = text.translate(str.maketrans('', '', '0123456789'))
    return clean_text

df['clean_text'] = df['clean_text'].apply(remove_numbers)

df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject color printing special request addit...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...


### (1.4) Supprimer les StopWords

Créer une fonction pour supprimer les mots vide de sens. Nous l'appliquons à clean_text.


In [8]:
# Utiliser ces outils
from nltk.corpus import stopwords 
from nltk import word_tokenize


**La sortie devrait ressembler à ça :**

<img src="./images/4.png">

In [9]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = word_tokenize(text)
    clean_tokens = [word for word in tokens if word not in stop_words]
    return clean_tokens
    

df['clean_text'] = df['clean_text'].apply(remove_stopwords)

df.head()


Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,"[subject, naturally, irresistible, corporate, ..."
1,Subject: the stock trading gunslinger fanny i...,1,"[subject, stock, trading, gunslinger, fanny, m..."
2,Subject: unbelievable new homes made easy im ...,1,"[subject, unbelievable, new, homes, made, easy..."
3,Subject: 4 color printing special request add...,1,"[subject, color, printing, special, request, a..."
4,"Subject: do not have money , get software cds ...",1,"[subject, money, get, software, cds, software,..."


### (1.5) Lemmatize

Créons une fonction pour lemmatiser le texte. Assurez-vous que la sortie est une chaîne unique, pas une liste de mots. Nous l'appliquons à clean_text


In [10]:
#Utiliser cet outil
from nltk.stem import WordNetLemmatizer


**La sortie devrait ressembler à cela :**

<img src="./images/5.png">

In [11]:
lemmatizer = WordNetLemmatizer()

def lemmatize(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    lemmatized_string = ' '.join(lemmatized_tokens)
    return lemmatized_string


df['clean_text'] = df['clean_text'].apply(lemmatize)

df.head()


Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible corporate ident...
1,Subject: the stock trading gunslinger fanny i...,1,subject stock trading gunslinger fanny merrill...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new home made easy im wan...
3,Subject: 4 color printing special request add...,1,subject color printing special request additio...
4,"Subject: do not have money , get software cds ...",1,subject money get software cd software compati...


## (2) CountVectorize

### (2.1) Numérisation des données textuelles en chiffres

Vectoriser clean_text avec un CountVectorize et laisser les paramatres par défault.

In [12]:
#Utiliser cet outil
from sklearn.feature_extraction.text import CountVectorizer

**La sortie devrait ressembler à ça :**

<img src="./images/6.png">

In [13]:
vectorizer = CountVectorizer()

X = pd.DataFrame(vectorizer.fit_transform(df['clean_text']).toarray())

X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30952,30953,30954,30955,30956,30957,30958,30959,30960,30961
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### (2.2) Multinomial Naive Bayes Modelling

Cross valider le model et scorer le model avec l'accuracy metrique.

**Vous devriez obtenir un score proche de 1.**

In [14]:
#Utiliser ces outils
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate


In [15]:
y = df['spam']

clf = MultinomialNB()

cv_results = cross_validate(clf, X, y, cv=5)

pd.DataFrame(cv_results)

Unnamed: 0,fit_time,score_time,test_score
0,4.0131,0.437912,0.986911
1,3.737859,0.300581,0.989529
2,2.774979,0.270533,0.991274
3,2.588649,0.291965,0.987773
4,2.642535,0.285681,0.99214


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer


vectorizer_tfidf = TfidfVectorizer()

X_tfidf = vectorizer_tfidf.fit_transform(df['clean_text'])

cv_tfidf_results = cross_validate(clf, X_tfidf, y, cv=5)

pd.DataFrame(cv_tfidf_results)

Unnamed: 0,fit_time,score_time,test_score
0,0.003509,0.000748,0.900524
1,0.003092,0.000751,0.906632
2,0.002954,0.000691,0.908377
3,0.003305,0.000694,0.894323
4,0.003065,0.00073,0.917031
