# Identifier un SPAM

Le but de cette exercice est d'identifier automatiquement un spam ou un email normal 

Nous passerons par les étapes suivantes : 

1. Nettoyage de la données.
2. Convertir le texte nettoyé en une representation numérique.
3. Appliquer un model **Multinomial Naive Bayes**, pour classifier chaque un email en spam ou en non spam.

## La librairie NLTK (Natural Language Toolkit)

Installer et/ou importer la librairie NLTK

In [1]:
# !pip install nltk

In [2]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /Users/Flo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/Flo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/Flo/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package omw-1.4 to /Users/Flo/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [3]:
import pandas as pd

df = pd.read_csv("./data/emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [4]:
df.tail()

Unnamed: 0,text,spam
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0
5727,Subject: news : aurora 5 . 2 update aurora ve...,0


## (1) Cleaning du text du dataset

L'ensemble de données est composé d'e-mails classés comme ham [0] ou spam[1]. Nous devons nettoyer les données avant de former un modèle de prédiction.

### (1.1) Supprimer la Ponctuation

Créer une fonction pour supprimer la ponctuation. Appliquez-le à la colonne de texte et ajoutez la sortie à une nouvelle colonne dans le dataframe appelé clean_text


**La sortie devrait ressembler à ça :**

In [5]:
import string

def remove_punctuation(text):
    for punctuation in string.punctuation: 
        text = text.replace(punctuation, ' ') 
    return text

df['clean_text'] = df.text.apply(remove_punctuation)

df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...,1,Subject the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...,1,Subject unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...,1,Subject 4 color printing special request add...
4,"Subject: do not have money , get software cds ...",1,Subject do not have money get software cds ...


### (1.2) Lower Case

Créer une fonction pour mettre le texte en minuscules. Nous l'appliquons à `clean_text`

**La sortie devrait ressembler à ça :**

In [6]:
def lowercase (text): 
    lowercased = text.lower() 
    return lowercased

df['clean_text'] = df.clean_text.apply(lowercase)

df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...,1,subject 4 color printing special request add...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds ...


### (1.3) Supprimer les nombres

Créer une fonction pour supprimer les nombres du texte. Nous l'appliquons à clean_text


**La sortie devrait ressembler à ça :**

In [7]:
def remove_numbers (text):
    words_only = ''.join([i for i in text if not i.isdigit()])
    return words_only

df['clean_text'] = df.clean_text.apply(remove_numbers)

df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...,1,subject color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds ...


### (1.4) Supprimer les StopWords

Créer une fonction pour supprimer les mots vide de sens. Nous l'appliquons à clean_text.


In [8]:
from nltk.corpus import stopwords 
from nltk import word_tokenize

stop_words = set(stopwords.words('english')) 

# Create function
def remove_stopwords (text):
    tokenized = word_tokenize(text)
    without_stopwords = [word for word in tokenized if not word in stop_words]
    return without_stopwords

df['clean_text'] = df.clean_text.apply(remove_stopwords)

df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,"[subject, naturally, irresistible, corporate, ..."
1,Subject: the stock trading gunslinger fanny i...,1,"[subject, stock, trading, gunslinger, fanny, m..."
2,Subject: unbelievable new homes made easy im ...,1,"[subject, unbelievable, new, homes, made, easy..."
3,Subject: 4 color printing special request add...,1,"[subject, color, printing, special, request, a..."
4,"Subject: do not have money , get software cds ...",1,"[subject, money, get, software, cds, software,..."


### (1.5) Lemmatize

Créons une fonction pour lemmatiser le texte. Assurez-vous que la sortie est une chaîne unique, pas une liste de mots. Nous l'appliquons à clean_text


In [9]:
from nltk.stem import WordNetLemmatizer

def lemma(text):
    lemmatizer = WordNetLemmatizer() # Instantiate lemmatizer
    lemmatized = [lemmatizer.lemmatize(word) for word in text] # Lemmatize
    lemmatized_string = " ".join(lemmatized)
    return lemmatized_string

df['clean_text'] = df.clean_text.apply(lemma)

df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible corporate ident...
1,Subject: the stock trading gunslinger fanny i...,1,subject stock trading gunslinger fanny merrill...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new home made easy im wan...
3,Subject: 4 color printing special request add...,1,subject color printing special request additio...
4,"Subject: do not have money , get software cds ...",1,subject money get software cd software compati...


**La sortie devrait ressembler à cela :**

<img src="./images/5.png">

## (2) CountVectorize

### (2.1) Numérisation des données textuelles en chiffres

Vectoriser clean_text avec un CountVectorize et laisser les paramatres par défault.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X_bow = vectorizer.fit_transform(df.clean_text)

In [17]:
vectorized_texts = pd.DataFrame(
    X_bow.toarray(), 
    columns = vectorizer.get_feature_names_out(),
    index = df.clean_text
)

vectorized_texts

Unnamed: 0_level_0,aa,aaa,aaaenerfax,aadedeji,aagrawal,aal,aaldous,aaliyah,aall,aanalysis,...,zwzm,zxghlajf,zyban,zyc,zygoma,zymg,zzmacmac,zzn,zzncacst,zzzz
clean_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
subject naturally irresistible corporate identity lt really hard recollect company market full suqgestions information isoverwhelminq good catchy logo stylish statlonery outstanding website make task much easier promise havinq ordered iogo company automaticaily become world ieader isguite ciear without good product effective business organization practicable aim hotat nowadays market promise marketing effort become much effective list clear benefit creativeness hand made original logo specially done reflect distinctive company image convenience logo stationery provided format easy use content management system letsyou change website content even structure promptness see logo draft within three business day affordability marketing break make gap budget satisfaction guaranteed provide unlimited amount change extra fee surethat love result collaboration look portfolio interested,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject stock trading gunslinger fanny merrill muzo colza attainder penultimate like esmark perspicuous ramble segovia group try slung kansa tanzania yes chameleon continuant clothesman libretto chesapeake tight waterway herald hawthorn like chisel morristown superior deoxyribonucleic clockwork try hall incredible mcdougall yes hepburn einsteinian earmark sapling boar duane plain palfrey inflexible like huzzah pepperoni bedtime nameable attire try edt chronography optimum yes pirogue diffusion albeit,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject unbelievable new home made easy im wanting show homeowner pre approved home loan fixed rate offer extended unconditionally credit way factor take advantage limited time opportunity ask visit website complete minute post approval form look foward hearing dorcas pittman,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject color printing special request additional information click click printable version order form pdf format phone fax e mail ramsey goldengraphix com request additional information click click printable version order form pdf format golden graphix printing azusa canyon rd irwindale ca e mail message advertisement solicitation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject money get software cd software compatibility great grow old along best yet tradgedies finish death comedy ended marriage,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
subject research development charge gpg forwarded shirley crenshaw hou ect vince j kaminski pm vera apodaca et enron enron cc vince j kaminski hou ect ect shirley crenshaw hou ect ect pinnamaneni krishnarao hou ect ect subject research development charge gpg vera shall talk accounting group correction vince pm vera apodaca enron vera apodaca enron vera apodaca enron pm pm pinnamaneni krishnarao hou ect ect cc vince j kaminski hou ect ect subject research development charge gpg per mail dated june kim watson supposed occurred true july fist six month reviewing july actuals able locate entry would pls let know whether entry made intend process thanks,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject receipt visit jim thanks invitation visit lsu shirley fedex receipt tomorrow vince james r garven pm vince j kaminski cc subject receipt visit dear vince thanks taking time visit faculty student got lot presentation favor ask concerning expense reimbursement process mail travel lodging receipt secretary joan payne following address joan payne department finance ceba louisiana state university baton rouge la thanks jim garven james r garven william h wright jr endowed chair financial service department finance ceba e j ourso college business administration louisiana state university baton rouge la voice fax e mail jgarven lsu edu home page http garven lsu edu vita http garven lsu edu dossier html research paper archive http garven lsu edu research html,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject enron case study update wow day super thank much vince coming baylor monday next week hash question list thanks john pm wrote good afternoon john want drop line update andy fastow confirmed one hour interview slot mr fastow monday december th noon addition schedule interview mr lay mr skilling outline question please hesitate contact regard cindy forwarded cindy derecskey corp enron pm cindy derecskey john martin cc vince j kaminski hou ect ect christie patrick hou ect ect subject enron case study document link cindy derecskey pm good afternoon john hope thing well writing update status meeting andy fastow ken lay jeff skilling arranged following meeting date time ken lay jeff skilling still trying work andy fastow schedule jeff skilling december th p ken lay december th p also attempt schedule meeting andy fastow december th convenience also allow u possibly schedule additional meeting th needed let know soon successful regard cindy derecskey university affair enron corp john martin carr p collins chair finance finance department baylor university po box waco tx office fax j martin baylor edu web http hsb baylor edu html martinj home html,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject interest david please call shirley crenshaw assistant extension set vince david p dupre pm vince j kaminski hou ect ect cc subject interest time available next day thanks david vince j kaminski pm david p dupre hou ect ect cc vince j kaminski hou ect ect subject interest david please stop chat minute vince david p dupre vince j kaminski hou ect ect cc subject interest may meet discus interest joining group strong quantitative discipline highly numerate thanks david forwarded david p dupre hou ect david p dupre hou ect ect cc subject interest vince kaminski,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### (2.2) Multinomial Naive Bayes Modelling

Cross valider le model et scorer le model avec l'accuracy metrique.

**Vous devriez obtenir un score proche de 1.**

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

cv_nb = cross_validate( MultinomialNB(), X_bow, df.spam, scoring = "accuracy")

cv_nb['test_score'].mean()

0.9895252901681946