## <center style="color:blue;">**HamOrSpamAI**</center>

### <center>**Système Anti-Spam Intelligent**</center>

Ce projet vise à développer un système de classification automatique des emails pour identifier les messages spam et non-spam à l'aide de techniques de ml ou dl, avec un prétraitement des textes basé sur Nltk et TF-IDF.

<br>

### <span style="color:green;">**Prétraitement des Données :**</span>

#### <span style="color:orange;">**1. Charger le Dataset :**</span>

In [28]:
import pandas as pd

df = pd.read_csv("../data/processed/data.csv")

df = df[["text","label"]]

print(df.shape)

df.head()

(28382, 2)


Unnamed: 0,text,label
0,softwar understand oem softwar lead temptat fi...,1
1,perspect ferc regulatori action client conf ca...,0
2,want tri ci li thought way expens viagra per d...,1
3,enron hpl actual decemb teco tap enron hpl ga ...,0
4,look cheap high qualiti softwar rotat napoleon...,1


#### <span style="color:orange;">**2. Diviser les Données :**</span>

In [29]:
from sklearn.model_selection import train_test_split

df = df.dropna()

X = df["text"]
y = df["label"]

X_train_text, X_test_text, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"- X_Train : {len(X_train_text)}\t- X_Test : {len(X_test_text)}\n")
print(f"- y_Train : {len(y_train)}\t- y_Test : {len(y_test)}")

- X_Train : 22703	- X_Test : 5676

- y_Train : 22703	- y_Test : 5676


#### <span style="color:orange;">**3. Vectoriser le texte avec `TfidfVectorizer()` :**</span>

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=8000)
X_train = vectorizer.fit_transform(X_train_text)
X_test = vectorizer.transform(X_test_text)

print(f"- X_Train : {X_train.shape}\t- X_Test : {X_test.shape}\n")

- X_Train : (22703, 8000)	- X_Test : (5676, 8000)



<br>

### <span style="color:green;">**Entraînement des Modèles :**</span>

#### <span style="color:orange;">**1. Définition des Modèles à Entraîner :**</span>

In [31]:
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier


models = {
    "Multinomial Naive Bayes" : MultinomialNB(),
    "Logistic Regression" : LogisticRegression(),
    "SVM" : LinearSVC(),
    "Random Forest" : RandomForestClassifier(),
    "Gradient Boosting" : GradientBoostingClassifier() 
}

results = []

for name, model in models.items() :
    print(f"Entraînement de {name} : ")
    model.fit(X_train, y_train)
    
    predicts = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, predicts)
    precision = precision_score(y_test, predicts, pos_label=1)
    recall = recall_score(y_test, predicts, pos_label=1)
    f1 = f1_score(y_test, predicts, pos_label=1)

    results.append([name, accuracy, precision, recall, f1])

Entraînement de Multinomial Naive Bayes : 
Entraînement de Logistic Regression : 
Entraînement de SVM : 
Entraînement de Random Forest : 
Entraînement de Gradient Boosting : 


#### <span style="color:orange;">**2. Comparaison entre les Perfermances des Modèles Entraînés :**</span>

In [32]:
results_df = pd.DataFrame(data=results, columns=["Model", "Accuracy", "Precision", "Recall", "F1 Score"])

results_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Multinomial Naive Bayes,0.983968,0.978054,0.98854,0.983269
1,Logistic Regression,0.986786,0.977834,0.994824,0.986256
2,SVM,0.99031,0.987491,0.992237,0.989858
3,Random Forest,0.985201,0.979159,0.990018,0.984559
4,Gradient Boosting,0.94697,0.913912,0.981146,0.946336
