<a href="https://colab.research.google.com/github/Shamil2007/Machine-Learning/blob/main/spam_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [19]:
!kaggle datasets download zeeshanyounas001/email-spam-detection

Dataset URL: https://www.kaggle.com/datasets/zeeshanyounas001/email-spam-detection
License(s): apache-2.0
email-spam-detection.zip: Skipping, found more recently modified local copy (use --force to force download)


In [20]:
!unzip /content/email-spam-detection.zip

Archive:  /content/email-spam-detection.zip
replace spam mail.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: spam mail.csv           
replace spam.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: spam.csv                


In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [22]:
df = pd.read_csv("/content/spam.csv", encoding='latin-1')
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [23]:
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
df

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [24]:
df['isSpam'] = df['v1'].apply(lambda x: 1 if x == 'spam' else 0)
df.drop('v1', axis = 1, inplace=True)
df = df.rename(columns={'v2': 'Message'})
df

Unnamed: 0,Message,isSpam
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
...,...,...
5567,This is the 2nd time we have tried 2 contact u...,1
5568,Will Ì_ b going to esplanade fr home?,0
5569,"Pity, * was in mood for that. So...any other s...",0
5570,The guy did some bitching but I acted like i'd...,0


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Message  5572 non-null   object
 1   isSpam   5572 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 87.2+ KB


In [26]:
df.isSpam.value_counts()

Unnamed: 0_level_0,count
isSpam,Unnamed: 1_level_1
0,4825
1,747


In [27]:
df.describe()

Unnamed: 0,isSpam
count,5572.0
mean,0.134063
std,0.340751
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [28]:
df.Message.value_counts()

Unnamed: 0_level_0,count
Message,Unnamed: 1_level_1
"Sorry, I'll call later",30
I cant pick the phone right now. Pls send a message,12
Ok...,10
"7 wonders in My WORLD 7th You 6th Ur style 5th Ur smile 4th Ur Personality 3rd Ur Nature 2nd Ur SMS and 1st \Ur Lovely Friendship\""... good morning dear""",4
"Say this slowly.? GOD,I LOVE YOU &amp; I NEED YOU,CLEAN MY HEART WITH YOUR BLOOD.Send this to Ten special people &amp; u c miracle tomorrow, do it,pls,pls do it...",4
...,...
I gotta collect da car at 6 lei.,1
No. On the way home. So if not for the long dry spell the season would have been over,1
Urgent! Please call 09061743811 from landline. Your ABTA complimentary 4* Tenerife Holiday or å£5000 cash await collection SAE T&Cs Box 326 CW25WX 150ppm,1
Dear 0776xxxxxxx U've been invited to XCHAT. This is our final attempt to contact u! Txt CHAT to 86688 150p/MsgrcvdHG/Suite342/2Lands/Row/W1J6HL LDN 18yrs,1


In [29]:
df.drop_duplicates(inplace=True)

In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.Message, df.isSpam, random_state=42, test_size=0.2)

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [32]:
nb_pipeline = Pipeline([
    ('vector', CountVectorizer()),
    ('nb', MultinomialNB())
])

nb_pipeline.fit(X_train, y_train)
print(f"Train score: {nb_pipeline.score(X_train, y_train)}, Test score: {nb_pipeline.score(X_test, y_test)}")

Train score: 0.992261185006046, Test score: 0.9854932301740812


🔹 CountVectorizer: Converts text into a numerical matrix by counting word occurrences.

🔹 MultinomialNB: A probabilistic classifier well-suited for text data, often used in spam detection, sentiment analysis, and topic classification.

In [33]:
y_pred = nb_pipeline.predict(X_test)

In [34]:
from sklearn.metrics import f1_score

f1 = f1_score(y_test, y_pred, average='weighted')
f1

0.9852518532141163