# Email Spam Detection System Using different Machine Learning Models

#### Import libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

#### Loading the Dataset

In [2]:
data = pd.read_csv('./spam.csv')
data

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will �_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


#### Explore the data

In [3]:
data.shape

(5572, 5)

In [4]:
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
data.sample(20)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
105,ham,Thanks a lot for your wishes on my birthday. T...,,,
1993,ham,Eh den sat u book e kb liao huh...,,,
4975,ham,You are gorgeous! keep those pix cumming :) th...,,,
5485,ham,Also fuck you and your family for going to rho...,,,
2811,ham,"Say this slowly.? GOD,I LOVE YOU &amp; I NEED ...",,,
3565,ham,Its ok..come to my home it vl nice to meet and...,,,
2126,ham,You do got a shitload of diamonds though,,,
229,ham,Dear good morning now only i am up,,,
1938,ham,Excellent! Are you ready to moan and scream in...,,,
3622,ham,"Damn, poor zac doesn't stand a chance",,,


In [6]:
data.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will �_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [8]:
data.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


#### Checking the null percentage

In [9]:
data.isnull().mean()*100

v1             0.000000
v2             0.000000
Unnamed: 2    99.102656
Unnamed: 3    99.784637
Unnamed: 4    99.892319
dtype: float64

#### Checking the duplicate values

In [10]:
data.duplicated().sum()

403

#### Dropping the duplicate values

In [11]:
data.drop_duplicates(inplace=True)

#### Featuring and Labeling

In [12]:
x = data['v2']  
y = data['v1']  

### Data Preprocessing

#### Splitting the dataset into training and testing sets.

In [13]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

####  Feature extraction using Term Frequency-Inverse Document Frequency

In [14]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  
x_train_tfidf = tfidf_vectorizer.fit_transform(x_train)
x_test_tfidf = tfidf_vectorizer.transform(x_test)


### Model selection and training

#### Naive Bayes

In [15]:
naivebayes_model = MultinomialNB()
naivebayes_model.fit(x_train_tfidf, y_train)

#### Support Vector Machine

In [16]:
svm_model = SVC(kernel='linear')
svm_model.fit(x_train_tfidf, y_train)

#### Random Forest

In [17]:
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(x_train_tfidf, y_train)

### Model evaluation

In [18]:
def evaluate_model(model, x_test, y_test):
    y_pred = model.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return accuracy, report

naivebayes_accuracy, naivebayes_report = evaluate_model(naivebayes_model, x_test_tfidf, y_test)
svm_accuracy, svm_report = evaluate_model(svm_model, x_test_tfidf, y_test)
rf_accuracy, rf_report = evaluate_model(random_forest, x_test_tfidf, y_test)

print("Naive Bayes Accuracy:", naivebayes_accuracy)
print(naivebayes_report)

print("\nSVM Accuracy:", svm_accuracy)
print(svm_report)

print("\nRandom Forest Accuracy:", rf_accuracy)
print(rf_report)


Naive Bayes Accuracy: 0.965183752417795
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       889
        spam       1.00      0.75      0.86       145

    accuracy                           0.97      1034
   macro avg       0.98      0.88      0.92      1034
weighted avg       0.97      0.97      0.96      1034


SVM Accuracy: 0.9864603481624759
              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       889
        spam       0.98      0.92      0.95       145

    accuracy                           0.99      1034
   macro avg       0.98      0.96      0.97      1034
weighted avg       0.99      0.99      0.99      1034


Random Forest Accuracy: 0.9796905222437138
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       889
        spam       0.99      0.86      0.92       145

    accuracy                           0.98      1034
   macro a

### Conclusion

Based on the evaluation metrics, the Support Vector Machine (SVM) model appears to be the best-performing model for email spam detection in this scenario. It achieves the highest accuracy, and its precision and recall values for spam are both well-balanced and satisfactory. The SVM model is effectively identifying most spam emails while minimizing false negatives.