Loading and Exploring the Data

In [1]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/DATASETS/Spam /spam.csv", encoding='latin-1')

df.sample(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
2840,ham,Ok thanx...,,,
4935,ham,K..k.:)congratulation ..,,,
5564,ham,Why don't you wait 'til at least wednesday to ...,,,
5278,spam,URGENT! Your Mobile number has been awarded wi...,,,
3999,spam,This is the 2nd time we have tried to contact ...,,,


In [2]:
df = df.iloc[:, :2]
df.columns = ['label', 'message']

In [3]:
df['label'] = df['label'].map({'spam': 1, 'ham': 0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['label'] = df['label'].map({'spam': 1, 'ham': 0})


In [4]:
df.isnull().sum()

Unnamed: 0,0
label,0
message,0


In [5]:
import re

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text

df['message'] = df['message'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['message'] = df['message'].apply(preprocess_text)


In [6]:
df.head()

Unnamed: 0,label,message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor u c already then say
4,0,nah i don t think he goes to usf he lives arou...


Feature Extraction (TF-IDF)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=3000)

X = tfidf.fit_transform(df['message']).toarray()
y = df['label'].values


Model Selection and Training

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Naive Bayes

In [10]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

y_pred = nb_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.9757847533632287


Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_lr))

Accuracy: 0.9641255605381166


Support Vector Machine (SVM)

In [12]:
from sklearn.svm import SVC

svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

y_pred_svm = svm_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_svm))

Accuracy: 0.9820627802690582


Conclusion: SVM outperforms other models by small margin