# Project: Spam Detection

- Date: Not marked - August 5 2025 

- Data: The data used in this project come from SpamAssasin Public Corpus. we can find them by this link: https://spamassassin.apache.org/old/publiccorpus/ 

- Description: This project aims to predict, regarding an email, if it's a spam or a ham. Then I'll try to give message that I recieved and let the model predict if it is spam or ham.

## Downloading and Cleaning of data

Firstly we will try to find the data and transform it into a type we can use.
The following code box was made by Chatgptand I made some modifications because I don't know how to download the data and this is not the main goal of that project.
But I understand it more now.

In [24]:
import glob
from email import policy
from email.parser import BytesParser

def load_messages(root_folder="publiccorpus"):
    paths = glob.glob(f"{root_folder}/**/*", recursive=True)
    texts, labels = [], []

    print(paths[0])
    for path in paths:
        # Ignorer les dossiers
        try:
            with open(path, 'rb') as f:
                msg = BytesParser(policy=policy.default).parse(f)
        except IsADirectoryError:
            continue

        # Extraction du body (plain ou html)
        body = msg.get_body(preferencelist=('plain', 'html'))
        #if not body: print("NOTHINGGGGG", msg.get_content)
        raw_text = ""
        if body:
            try:
                # méthode standard (peut lever LookupError)
                raw_text = body.get_content()
            except LookupError:
                # fallback : décoder le payload manuellement
                payload = body.get_payload(decode=True) or b""
                try:
                    raw_text = payload.decode('utf-8', errors='replace')
                except (UnicodeDecodeError, LookupError):
                    raw_text = payload.decode('latin-1', errors='replace')

        texts.append(raw_text)
        labels.append(0 if "ham" in path.lower() else 1)

    return texts, labels

if __name__ == "__main__":
    texts, labels = load_messages("/home/christian/ProjetsPerso/IA/MachineLearning/Spam_Detection/data")
    print(f"Downloaded: {len(texts)} messages ({sum(labels)} spam, {len(labels)-sum(labels)} ham)")


/home/christian/ProjetsPerso/IA/MachineLearning/Spam_Detection/data/20030228_easy_ham_2
Downloaded: 6552 messages (2399 spam, 4153 ham)


In [25]:
#Display of some text
print(texts[0:1])
print(type(texts))

['Quoting Niall O Broin <niall@linux.ie>:\n\n> I\'m installing warm standby disks on a number of boxes. These disks will be\n> the same size (sometimes bigger) than the main disk. The idea is that every\n> night I\'ll rsync the partitions on the main disk to the standby disk so\n> that\n> in the case of disaster, the first port of call, before the tapes, is the\n> standby disk. (We did consider running Linux md RAID on the disks but RAID\n> gives you no protection against slips of the finger)\n\nDo I get beaten round the head for saying "floppy"?\nAssuming the machines are networked, let each one send a copy of its kernel to\nthe others.  If the drives are open-the-box-and-switch-cables, then you can\nstart dd\'ing a floppy before you start.  If the drives are in drawers, then this\nmight slow you down by all of 60 seconds.\n\nAlternatively, you could use netboot.  No, I\'m serious.  Set the boot sequence\nto first hard disk then network.  Do NOT make any partition on the standby\nacti

In [26]:
#Cleaning part:

import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
import re

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/christian/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/christian/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In this phase, we will do what we call preprocessing the text.

It consists into tranform all characters in the text in a lowercase, then delete all special characters(:,@,!,?,...), all email adresses and keep only letters and numbers. Then we transform the text in different tokens and delete the stopwords(very used word with no real meaning or importance like "the","of", "is", "a", "but",...).

In [27]:
def preprocess_text_nltk(text):
    tokens=[]# For the text's tokens
    stop_words = set(stopwords.words('english')) # all the stopword in english(eg:"the","is","of"...) because not very useful for our need
    stemmer = PorterStemmer()

    text = re.sub(r'\S+@\S+', ' ', text)
    text = re.sub(r'http\S+|www\.\S+', ' ', text)
    text = re.sub(r'[^A-Za-z0-9\s]', ' ', text)
    
    #Conversion en minuscule
    text=text.lower()
    
    #Suppression of pontuation
    text = ''.join([char for char in text if char not in string.punctuation])

    #Tokenization
    tokens=word_tokenize(text)

    #Suppresion of stopword
    tokens=[w for w in tokens if w not in stop_words]

    #Stemming (reduction of the number of token by gathering different variant of the same word or token)
    tokens=[stemmer.stem(w) for w in tokens]

    return ' '.join(tokens)


We will transform the text in a numeric vetor that could be used in our models. This transformation give a score we call TD-IDF which gave more importance to less used words in a text.

That is very smart and interesting, that's means that to make a difference we can not focus on the most used words in the text but on the less used words in the text. Indeed, generally the most used words in a text will be also very used in another text so focusing on that will not help us to make find differences between the texts, so we focus on the less used one.

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer=TfidfVectorizer(preprocessor=preprocess_text_nltk,
                           tokenizer=lambda txt: txt.split())
X=vectorizer.fit_transform(texts)
y=labels



In [38]:
print("Shape:", X.shape)
print(vectorizer.get_feature_names_out())
print("Matrix:\n",X.toarray())
print("Vocab size:", len(vectorizer.vocabulary_))
print("Some features:\n", list(vectorizer.vocabulary_.keys())[:20])

Shape: (6552, 46416)
['0' '00' '000' ... 'zzzzason' 'zzzzcc' 'zzzzteana']
Matrix:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Vocab size: 46416
Some features:
 ['quot', 'niall', 'broin', 'instal', 'warm', 'standbi', 'disk', 'number', 'box', 'size', 'sometim', 'bigger', 'main', 'idea', 'everi', 'night', 'rsync', 'partit', 'case', 'disast']


In [39]:
# Little test to know how it is work
texts_raw = [
    "Hello @moimemeéé, this is a test email! Offer inside.",
    "URGENT: You have won $1000. Click here!!!",
    "Bonjour, ceci est un email légitime."
]

vectorizer2 = TfidfVectorizer(
    preprocessor=preprocess_text_nltk,
    tokenizer=lambda txt: txt.split(),
    lowercase=False
)

## 1 Debug preprocess
for txt in texts_raw:
    print("->", preprocess_text_nltk(txt))

## 2 Fit transform
X_test2 = vectorizer2.fit_transform(texts_raw)
print("Shape:", X_test2.shape)
print("Vocab:", vectorizer2.get_feature_names_out())
print("Matrix:\n", X_test2.toarray())
print(vectorizer2.get_feature_names_out())
print(type(texts_raw))

-> hello moimem test email offer insid
-> urgent 1000 click
-> bonjour ceci est un email l gitim
Shape: (3, 15)
Vocab: ['1000' 'bonjour' 'ceci' 'click' 'email' 'est' 'gitim' 'hello' 'insid' 'l'
 'moimem' 'offer' 'test' 'un' 'urgent']
Matrix:
 [[0.         0.         0.         0.         0.32200242 0.
  0.         0.42339448 0.42339448 0.         0.42339448 0.42339448
  0.42339448 0.         0.        ]
 [0.57735027 0.         0.         0.57735027 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.57735027]
 [0.         0.38988801 0.38988801 0.         0.29651988 0.38988801
  0.38988801 0.         0.         0.38988801 0.         0.
  0.         0.38988801 0.        ]]
['1000' 'bonjour' 'ceci' 'click' 'email' 'est' 'gitim' 'hello' 'insid' 'l'
 'moimem' 'offer' 'test' 'un' 'urgent']
<class 'list'>


## Building of Models

In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [41]:
#Splitting the data
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.8,
                                               random_state=42,#to keep the same result
                                               stratify=y)

#Building of the models and fitting
reg_log =LogisticRegression().fit(X_train,y_train)
reg_for =RandomForestClassifier().fit(X_train,y_train)
reg_grad=GradientBoostingClassifier().fit(X_train,y_train)
reg_nn  =KNeighborsClassifier(n_neighbors=5).fit(X_train,y_train)

#Prediction
y_pred_log =reg_log.predict(X_test) 
y_pred_for =reg_for.predict(X_test) 
y_pred_grad=reg_grad.predict(X_test)
y_pred_nn  =reg_nn.predict(X_test)

#Evaluation
def printMetrics(method_title, y_predict, y_true):
    print(method_title) 
    print("\t Accuracy score: ", accuracy_score(y_true, y_predict))
    print("\t Precision score: ", precision_score(y_true, y_predict))
    print("\t Recall score: ", recall_score(y_true, y_predict))
    print("\t F1 score : ", f1_score(y_true, y_predict))


printMetrics("Logistic Regression :",y_pred_log, y_test)
printMetrics("K-Nearest Neighbor :",y_pred_nn, y_test)
printMetrics("Random Forest :",y_pred_for, y_test)
printMetrics("Gradient Bosting :",y_pred_grad, y_test)

Logistic Regression :
	 Accuracy score:  0.9557589626239512
	 Precision score:  0.9709821428571429
	 Recall score:  0.90625
	 F1 score :  0.9375
K-Nearest Neighbor :
	 Accuracy score:  0.5316552250190694
	 Precision score:  0.43703007518796994
	 Recall score:  0.96875
	 F1 score :  0.6023316062176166
Random Forest :
	 Accuracy score:  0.9748283752860412
	 Precision score:  0.976545842217484
	 Recall score:  0.9541666666666667
	 F1 score :  0.9652265542676501
Gradient Bosting :
	 Accuracy score:  0.9687261632341724
	 Precision score:  0.9720430107526882
	 Recall score:  0.9416666666666667
	 F1 score :  0.9566137566137566


We can notice that the logistic regression, the random forest and the Gradient bosting models are very good because their score are very high. However, the K-neirest neighbor model is not very appropriate for this task even if he predict correctly the good tweet and have the best score for that task ( highest score for the recall score).

## Test

Here we will give to our models some real spam and ham messages recieved and see what will be the result.

In [65]:
from email import policy
from email.parser import BytesParser
from bs4 import BeautifulSoup
from pathlib import Path

def load_messages(root_folder="publiccorpus"):
    paths = glob.glob(f"{root_folder}/**/*", recursive=True)
    texts, labels = [], []

    print(paths[0])
    for path in paths:
        # Ignorer les dossiers
        try:
            with open(path, 'rb') as f:
                msg = BytesParser(policy=policy.default).parse(f)
        except IsADirectoryError:
            continue

        # Extraction du body (plain ou html)
        body = msg.get_body(preferencelist=('plain', 'html'))
        raw_text = ""
        if body:
            print("body")
            try:
                # méthode standard (peut lever LookupError)
                raw_text = body.get_content()
            except LookupError:
                # fallback : décoder le payload manuellement
                payload = body.get_payload(decode=True) or b""
                try:
                    raw_text = payload.decode('utf-8', errors='replace')
                except (UnicodeDecodeError, LookupError):
                    raw_text = payload.decode('latin-1', errors='replace')
        
        if not body and not msg.is_multipart():
            print("no body")
            try:
                raw_text=msg.get_content().strip()
            except Exception:
                continue

        texts.append(raw_text)
        labels.append(0 if "ham" in path.lower() else 1)

    return texts, labels

text_test,labels_test =load_messages("/home/christian/ProjetsPerso/Artificial_Intelligence/MachineLearning/Spam_Detection/test/")

print(labels_test)

/home/christian/ProjetsPerso/Artificial_Intelligence/MachineLearning/Spam_Detection/test/test
no body
no body
no body
no body
[0, 0, 1, 1]


In [66]:
print(text_test)

['Hello Sir,\n\nI am writing this email on behalf of the whole group to thank you for the time you gave us last Saturday.\n\n\nThis initial interview was extremely useful, both in terms of our choice of specialization in our second year and the important, up-to-date information you provided about the data scientist profession. It also helped us gain a clearer picture of our entire journey at Ensimag to achieve our goals.\n\n\nWe hope to see you again one day. Thank you, and see you soon.\n\n\nBest regards.', 'Hello,\n\nPlease send me the documents listed below as a matter of urgency. Some documents are attached and need to be completed.\n\nPlease send them by email to prepa-ginp@lycee-blaisepascal.com or to the office by Monday, September 20 at the latest:\n\n    A copy of the civil status document for the student and their parents (ID card or passport)\n    A certificate of nationality for the student,\n    The medical file completed by a doctor,\n    A copy of the vaccination record,

In [71]:
X_test=vectorizer.transform(text_test)
y_test=labels_test

#Predictions
y_pred_log =reg_log.predict(X_test) 
y_pred_for =reg_for.predict(X_test) 
y_pred_grad=reg_grad.predict(X_test)
y_pred_nn  =reg_nn.predict(X_test)

#Evaluations
def printMetrics(method_title, y_predict, y_true):
    print(method_title) 
    print("\t Accuracy score: ", accuracy_score(y_true, y_predict))
    print("\t Precision score: ", precision_score(y_true, y_predict))
    print("\t Recall score: ", recall_score(y_true, y_predict))
    print("\t F1 score : ", f1_score(y_true, y_predict))

    for i in range(len(y_predict)):
        print("\t True value:",y_true[i] ,"Prediction:",y_predict[i])

printMetrics("Logistic Regression :",y_pred_log, y_test)
printMetrics("K-Nearest Neighbor :",y_pred_nn, y_test)
printMetrics("Random Forest :",y_pred_for, y_test)
printMetrics("Gradient Bosting :",y_pred_grad, y_test)

Logistic Regression :
	 Accuracy score:  0.75
	 Precision score:  1.0
	 Recall score:  0.5
	 F1 score :  0.6666666666666666
	 True value: 0 Prediction: 0
	 True value: 0 Prediction: 0
	 True value: 1 Prediction: 1
	 True value: 1 Prediction: 0
K-Nearest Neighbor :
	 Accuracy score:  0.5
	 Precision score:  0.5
	 Recall score:  1.0
	 F1 score :  0.6666666666666666
	 True value: 0 Prediction: 1
	 True value: 0 Prediction: 1
	 True value: 1 Prediction: 1
	 True value: 1 Prediction: 1
Random Forest :
	 Accuracy score:  0.75
	 Precision score:  1.0
	 Recall score:  0.5
	 F1 score :  0.6666666666666666
	 True value: 0 Prediction: 0
	 True value: 0 Prediction: 0
	 True value: 1 Prediction: 1
	 True value: 1 Prediction: 0
Gradient Bosting :
	 Accuracy score:  0.75
	 Precision score:  1.0
	 Recall score:  0.5
	 F1 score :  0.6666666666666666
	 True value: 0 Prediction: 0
	 True value: 0 Prediction: 0
	 True value: 1 Prediction: 1
	 True value: 1 Prediction: 0


#### Comments:
For the moment, the number of messages given for the text are too low to have a good interpretation, so we will see it later.

# Conclusion

This project was very usefull because it helps me to learn our to clean and made a text or a data understandible for a cumputer science task. Then I build models like in the others project to detect spam and ham email. 