## <center style="color:blue;">**HamOrSpamAI**</center>

### <center>**Système Anti-Spam Intelligent**</center>

Ce projet vise à développer un système de classification automatique des emails pour identifier les messages spam et non-spam à l'aide de techniques de ml ou dl, avec un prétraitement des textes basé sur Nltk et TF-IDF.

<br>

### <span style="color:green;">**Chargement des Données :**</span>

#### <span style="color:orange;">**1. Charger le Dataset :**</span>

In [33]:
import pandas as pd

df = pd.read_csv("../data/clean/data.csv")

print(df.shape)

df.head()

(31148, 8)


Unnamed: 0,_c0,message_id,text,label,label_text,subject,message,date
0,0,33214,any software just for 15 $ - 99 $ understandin...,1,spam,any software just for 15 $ - 99 $,understanding oem software\nlead me not into t...,2005-06-18
1,1,11929,perspective on ferc regulatory action client c...,0,ham,perspective on ferc regulatory action client c...,"19 th , 2 : 00 pm edt\nperspective on ferc reg...",2001-06-19
2,2,19784,wanted to try ci 4 lis but thought it was way ...,1,spam,wanted to try ci 4 lis but thought it was way ...,viagra at $ 1 . 12 per dose\nready to boost yo...,2004-09-11
3,3,2209,"enron / hpl actuals for december 11 , 2000 tec...",0,ham,"enron / hpl actuals for december 11 , 2000",teco tap 30 . 000 / enron ; 120 . 000 / hpl ga...,2000-12-12
4,4,15880,looking for cheap high - quality software ? ro...,1,spam,looking for cheap high - quality software ? ro...,"water past also , burn , course . gave country...",2005-02-13


#### <span style="color:orange;">**2. Supprimer les Colonnes non Pertinents :**</span>

In [34]:
df = pd.DataFrame(data=df, columns=["text","label"])

df.head()

Unnamed: 0,text,label
0,any software just for 15 $ - 99 $ understandin...,1
1,perspective on ferc regulatory action client c...,0
2,wanted to try ci 4 lis but thought it was way ...,1
3,"enron / hpl actuals for december 11 , 2000 tec...",0
4,looking for cheap high - quality software ? ro...,1


<br>

### <span style="color:green;">**Prétraitement des Données :**</span>

#### <span style="color:orange;">**1. Normaliser le Dataset :**</span>

In [35]:
df["text"] = df["text"].apply(lambda x : str(x).lower())

df.head()

Unnamed: 0,text,label
0,any software just for 15 $ - 99 $ understandin...,1
1,perspective on ferc regulatory action client c...,0
2,wanted to try ci 4 lis but thought it was way ...,1
3,"enron / hpl actuals for december 11 , 2000 tec...",0
4,looking for cheap high - quality software ? ro...,1


#### <span style="color:orange;">**2. Nettoyer le Dataset :**</span>

In [36]:
df = df.dropna()

df = df[df["text"].str.strip() != ""]

df = df.drop_duplicates()

print(df.shape)

df.head()

(28382, 2)


Unnamed: 0,text,label
0,any software just for 15 $ - 99 $ understandin...,1
1,perspective on ferc regulatory action client c...,0
2,wanted to try ci 4 lis but thought it was way ...,1
3,"enron / hpl actuals for december 11 , 2000 tec...",0
4,looking for cheap high - quality software ? ro...,1


#### <span style="color:orange;">**3. Tokenisation :**</span>

In [37]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")
nltk.download("punkt_tab")

df["sent_tokens"] = df["text"].apply(lambda x: sent_tokenize(x) if isinstance(x, str) else [])

df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abdel\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\abdel\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


Unnamed: 0,text,label,sent_tokens
0,any software just for 15 $ - 99 $ understandin...,1,[any software just for 15 $ - 99 $ understandi...
1,perspective on ferc regulatory action client c...,0,[perspective on ferc regulatory action client ...
2,wanted to try ci 4 lis but thought it was way ...,1,[wanted to try ci 4 lis but thought it was way...
3,"enron / hpl actuals for december 11 , 2000 tec...",0,"[enron / hpl actuals for december 11 , 2000 te..."
4,looking for cheap high - quality software ? ro...,1,"[looking for cheap high - quality software ?, ..."


In [38]:
from nltk.tokenize import word_tokenize

df["word_tokens"] = df["text"].apply(lambda x: word_tokenize(x) if isinstance(x, str) else [])

df.head()

Unnamed: 0,text,label,sent_tokens,word_tokens
0,any software just for 15 $ - 99 $ understandin...,1,[any software just for 15 $ - 99 $ understandi...,"[any, software, just, for, 15, $, -, 99, $, un..."
1,perspective on ferc regulatory action client c...,0,[perspective on ferc regulatory action client ...,"[perspective, on, ferc, regulatory, action, cl..."
2,wanted to try ci 4 lis but thought it was way ...,1,[wanted to try ci 4 lis but thought it was way...,"[wanted, to, try, ci, 4, lis, but, thought, it..."
3,"enron / hpl actuals for december 11 , 2000 tec...",0,"[enron / hpl actuals for december 11 , 2000 te...","[enron, /, hpl, actuals, for, december, 11, ,,..."
4,looking for cheap high - quality software ? ro...,1,"[looking for cheap high - quality software ?, ...","[looking, for, cheap, high, -, quality, softwa..."


#### <span style="color:orange;">**4. Retirer les Stopwords :**</span>

In [39]:
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

stop_words = set(stopwords.words('english'))

df["tokens"] = df["word_tokens"].apply(
    lambda tokens : [token for token in tokens if token not in stop_words]
)

df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abdel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,label,sent_tokens,word_tokens,tokens
0,any software just for 15 $ - 99 $ understandin...,1,[any software just for 15 $ - 99 $ understandi...,"[any, software, just, for, 15, $, -, 99, $, un...","[software, 15, $, -, 99, $, understanding, oem..."
1,perspective on ferc regulatory action client c...,0,[perspective on ferc regulatory action client ...,"[perspective, on, ferc, regulatory, action, cl...","[perspective, ferc, regulatory, action, client..."
2,wanted to try ci 4 lis but thought it was way ...,1,[wanted to try ci 4 lis but thought it was way...,"[wanted, to, try, ci, 4, lis, but, thought, it...","[wanted, try, ci, 4, lis, thought, way, expens..."
3,"enron / hpl actuals for december 11 , 2000 tec...",0,"[enron / hpl actuals for december 11 , 2000 te...","[enron, /, hpl, actuals, for, december, 11, ,,...","[enron, /, hpl, actuals, december, 11, ,, 2000..."
4,looking for cheap high - quality software ? ro...,1,"[looking for cheap high - quality software ?, ...","[looking, for, cheap, high, -, quality, softwa...","[looking, cheap, high, -, quality, software, ?..."


#### <span style="color:orange;">**5. Supprimer la Ponctuation & Caractères Spéciaux :**</span>

In [41]:
import re

df["clean_tokens"] = df["tokens"].apply(
    lambda tokenss: [token for token in tokenss if re.match(r"^[a-z]+$", token)]
)

df.head()

Unnamed: 0,text,label,sent_tokens,word_tokens,tokens,clean_tokens
0,any software just for 15 $ - 99 $ understandin...,1,[any software just for 15 $ - 99 $ understandi...,"[any, software, just, for, 15, $, -, 99, $, un...","[software, 15, $, -, 99, $, understanding, oem...","[software, understanding, oem, software, lead,..."
1,perspective on ferc regulatory action client c...,0,[perspective on ferc regulatory action client ...,"[perspective, on, ferc, regulatory, action, cl...","[perspective, ferc, regulatory, action, client...","[perspective, ferc, regulatory, action, client..."
2,wanted to try ci 4 lis but thought it was way ...,1,[wanted to try ci 4 lis but thought it was way...,"[wanted, to, try, ci, 4, lis, but, thought, it...","[wanted, try, ci, 4, lis, thought, way, expens...","[wanted, try, ci, lis, thought, way, expensive..."
3,"enron / hpl actuals for december 11 , 2000 tec...",0,"[enron / hpl actuals for december 11 , 2000 te...","[enron, /, hpl, actuals, for, december, 11, ,,...","[enron, /, hpl, actuals, december, 11, ,, 2000...","[enron, hpl, actuals, december, teco, tap, enr..."
4,looking for cheap high - quality software ? ro...,1,"[looking for cheap high - quality software ?, ...","[looking, for, cheap, high, -, quality, softwa...","[looking, cheap, high, -, quality, software, ?...","[looking, cheap, high, quality, software, rota..."
