Spam Mail Detector

Installing dependencies

In [1]:
!pip install numpy pandas matplotlib seaborn scikit-learn nltk jupyter ipykernel

Collecting numpy
  Downloading numpy-2.3.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m235.2 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pandas
  Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m570.8 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting matplotlib
  Downloading matplotlib-3.10.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting jupyter
  Downloading jupyter-1.1.1-p

Step 1: Loading dataset

In [2]:
import pandas as pd

data=pd.read_csv("SMSSpamCollection.csv",sep="\t",names=["label","message"])


data['label_num'] = data["label"].map({'ham':0,'spam':1})

print(data.head())

  label                                            message  label_num
0   ham  Go until jurong point, crazy.. Available only ...          0
1   ham                      Ok lar... Joking wif u oni...          0
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...          1
3   ham  U dun say so early hor... U c already then say...          0
4   ham  Nah I don't think he goes to usf, he lives aro...          0


Step 2: Preprocessing the text

In [3]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def pre_text(text):
    text=text.lower() #for lowercase
    text=re.sub(r'[^a-z\s]','',text)
    tokens=[word for word in text.split() if word not in stop_words]
    return " ".join(tokens)

data['clean_data']=data['message'].apply(pre_text)

print(data[['message','clean_data']].head())

                                             message  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   

                                          clean_data  
0  go jurong point crazy available bugis n great ...  
1                            ok lar joking wif u oni  
2  free entry wkly comp win fa cup final tkts st ...  
3                u dun say early hor u c already say  
4        nah dont think goes usf lives around though  


[nltk_data] Downloading package stopwords to /home/bat/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Step 3: Converting text into numeric features

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect=TfidfVectorizer()
X=vect.fit_transform(data['clean_data'])
y=data['label_num']

print("Feature matrix structure ->",X.shape)

Feature matrix structure -> (5572, 8480)


Step 4: Split into train-test data

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=45)

print("Tran data size:",X_train.shape)
print("Test data size:",X_test.shape)

Tran data size: (3900, 8480)
Test data size: (1672, 8480)


Step 5: Training model

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

nb=MultinomialNB()
nb.fit(X_train,y_train)
ypredn=nb.predict(X_test)

lr=LogisticRegression(max_iter=200)
lr.fit(X_train,y_train)
ypredl=lr.predict(X_test)

Step 6: Model performance

In [14]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,classification_report


print("Naive Bayes:")
print("\nClassification report:\n ",classification_report(y_test,ypredn))
print("\nAccuracy: ",accuracy_score(y_test,ypredn))
print("Precision: ",precision_score(y_test,ypredn))
print("Recall: ",recall_score(y_test,ypredn))
print("f1_score: ",f1_score(y_test,ypredn))


print("\n\n\n\nLogistic Regression:")
print("\nClassification report:\n ",classification_report(y_test,ypredl))
print("\nAccuracy: ",accuracy_score(y_test,ypredl))
print("Precision: ",precision_score(y_test,ypredl))
print("Recall: ",recall_score(y_test,ypredl))
print("f1_score: ",f1_score(y_test,ypredl))



Naive Bayes:

Classification report:
                precision    recall  f1-score   support

           0       0.95      1.00      0.97      1448
           1       1.00      0.64      0.78       224

    accuracy                           0.95      1672
   macro avg       0.97      0.82      0.88      1672
weighted avg       0.95      0.95      0.95      1672


Accuracy:  0.9521531100478469
Precision:  1.0
Recall:  0.6428571428571429
f1_score:  0.782608695652174




Logistic Regression:

Classification report:
                precision    recall  f1-score   support

           0       0.94      1.00      0.97      1448
           1       0.98      0.58      0.73       224

    accuracy                           0.94      1672
   macro avg       0.96      0.79      0.85      1672
weighted avg       0.95      0.94      0.94      1672


Accuracy:  0.9425837320574163
Precision:  0.9848484848484849
Recall:  0.5803571428571429
f1_score:  0.7303370786516854
