# Identifier un SPAM

Le but de cette exercice est d'identifier automatiquement un spam ou un email normal 

Nous passerons par les étapes suivantes : 

1. Nettoyage de la données.
2. Convertir le texte nettoyé en une representation numérique.
3. Appliquer un model **Multinomial Naive Bayes**, pour classifier chaque un email en spam ou en non spam.

## La librairie NLTK (Natural Language Toolkit)

Installer et/ou importer la librairie NLTK

In [1]:
!pip install nltk



In [2]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /Users/heahg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/heahg/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/heahg/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/heahg/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [3]:
import pandas as pd

df = pd.read_csv("./data/emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [4]:
df.tail()

Unnamed: 0,text,spam
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0
5727,Subject: news : aurora 5 . 2 update aurora ve...,0


## (1) Cleaning du text du dataset

L'ensemble de données est composé d'e-mails classés comme ham [0] ou spam[1]. Nous devons nettoyer les données avant de former un modèle de prédiction.

### (1.1) Supprimer la Ponctuation

Créer une fonction pour supprimer la ponctuation. Appliquez-le à la colonne de texte et ajoutez la sortie à une nouvelle colonne dans le dataframe appelé clean_text


**La sortie devrait ressembler à ça :**

<img src='./images/1.png'>

In [5]:
import string

def delete_punctuation(text):
    return ''.join([word for word in text if word not in string.punctuation])

df['clean_text'] = df['text'].apply(delete_punctuation)
df.head()


Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,Subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,Subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,Subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,Subject do not have money get software cds fr...


### (1.2) Lower Case

Créer une fonction pour mettre le texte en minuscules. Nous l'appliquons à `clean_text`

**La sortie devrait ressembler à ça :**

<img src="./images/2.png">

In [6]:
def lower_case(text):
    return text.lower()

In [7]:
df["clean_text"] = df["clean_text"].apply(lower_case)
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...


### (1.3) Supprimer les nombres

Créer une fonction pour supprimer les nombres du texte. Nous l'appliquons à clean_text


**La sortie devrait ressembler à ça :**

<img src="./images/3.png">

In [8]:
def delete_number(text):
    return "".join(word for word in text if not word.isdigit())

In [9]:
df["clean_text"] = df["clean_text"].apply(delete_number)
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject color printing special request addit...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...


### (1.4) Supprimer les StopWords

Créer une fonction pour supprimer les mots vide de sens. Nous l'appliquons à clean_text.


In [10]:
# Utiliser ces outils
from nltk.corpus import stopwords 
from nltk import word_tokenize


**La sortie devrait ressembler à ça :**

<img src="./images/4.png">

In [11]:
def delete_stopwords(text):
    stop_words = set(stopwords.words("english"))
    tokens = word_tokenize(text)
    clean_tokens = [token for token in tokens if token not in stop_words]
    return clean_tokens
    

In [12]:
df["clean_text"] = df["clean_text"].apply(delete_stopwords)
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,"[subject, naturally, irresistible, corporate, ..."
1,Subject: the stock trading gunslinger fanny i...,1,"[subject, stock, trading, gunslinger, fanny, m..."
2,Subject: unbelievable new homes made easy im ...,1,"[subject, unbelievable, new, homes, made, easy..."
3,Subject: 4 color printing special request add...,1,"[subject, color, printing, special, request, a..."
4,"Subject: do not have money , get software cds ...",1,"[subject, money, get, software, cds, software,..."


### (1.5) Lemmatize

Créons une fonction pour lemmatiser le texte. Assurez-vous que la sortie est une chaîne unique, pas une liste de mots. Nous l'appliquons à clean_text


In [13]:
#Utiliser cet outil
from nltk.stem import WordNetLemmatizer
def words_lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    return " ".join([lemmatizer.lemmatize(word) for word in text])

df["clean_text"] = df["clean_text"].apply(words_lemmatize)
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible corporate ident...
1,Subject: the stock trading gunslinger fanny i...,1,subject stock trading gunslinger fanny merrill...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new home made easy im wan...
3,Subject: 4 color printing special request add...,1,subject color printing special request additio...
4,"Subject: do not have money , get software cds ...",1,subject money get software cd software compati...


**La sortie devrait ressembler à cela :**

<img src="./images/5.png">

## (2) CountVectorize

### (2.1) Numérisation des données textuelles en chiffres

Vectoriser clean_text avec un CountVectorize et laisser les paramatres par défault.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

vectorizer.fit(df["clean_text"])

vectorizer.vocabulary_



{'subject': 26408,
 'naturally': 18174,
 'irresistible': 13942,
 'corporate': 5887,
 'identity': 12887,
 'lt': 16105,
 'really': 22430,
 'hard': 11939,
 'recollect': 22546,
 'company': 5195,
 'market': 16588,
 'full': 10784,
 'suqgestions': 26646,
 'information': 13394,
 'isoverwhelminq': 13986,
 'good': 11399,
 'catchy': 4066,
 'logo': 15930,
 'stylish': 26389,
 'statlonery': 26070,
 'outstanding': 19551,
 'website': 29822,
 'make': 16383,
 'task': 27055,
 'much': 17861,
 'easier': 8223,
 'promise': 21609,
 'havinq': 12039,
 'ordered': 19339,
 'iogo': 13866,
 'automaticaily': 1980,
 'become': 2471,
 'world': 30316,
 'ieader': 12904,
 'isguite': 13966,
 'ciear': 4632,
 'without': 30199,
 'product': 21517,
 'effective': 8412,
 'business': 3606,
 'organization': 19374,
 'practicable': 21149,
 'aim': 666,
 'hotat': 12584,
 'nowadays': 18747,
 'marketing': 16595,
 'effort': 8425,
 'list': 15787,
 'clear': 4790,
 'benefit': 2588,
 'creativeness': 6097,
 'hand': 11869,
 'made': 16255,
 'orig

**La sortie devrait ressembler à ça :**

In [15]:

vector = vectorizer.transform(df["clean_text"])

pd.DataFrame(vector.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30952,30953,30954,30955,30956,30957,30958,30959,30960,30961
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<img src="./images/6.png">

### (2.2) Multinomial Naive Bayes Modelling

Cross valider le model et scorer le model avec l'accuracy metrique.

**Vous devriez obtenir un score proche de 1.**

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df["clean_text"])

y = df["spam"]

clf = MultinomialNB()

scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')

f"Accuracy moyenne : {scores.mean()}"


'Accuracy moyenne : 0.9895252901681946'