## Introduction

Short Message Service, commonly abbreviated as SMS, is a text messaging service component of most telephone, Internet and mobile device systems. It uses standardized communication protocols that let mobile phones exchange short text messages.

Spam refers to messages which are unsolicited and unwanted. Usually, spam texts are not coming from another phone. They mainly originate from a computer and are sent to your phone via an email address or instant messaging account.

A SMS Spam Detection model takes an SMS as input and predicts whether the message is a spam or not spam message. 

In [10]:
# Required Libraries 

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix,classification_report

In [3]:
df = pd.read_table('sms.tsv', names= ['Spam','Mail'])
df

Unnamed: 0,Spam,Mail
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [4]:
x = df['Mail']
y = df['Spam']

In [12]:
y.value_counts()

Spam
ham     4825
spam     747
Name: count, dtype: int64

In [5]:
# Train Test Split

xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size= 0.8, stratify= y)

In [13]:
# Count Vectorizer to create DTM

vect = CountVectorizer(stop_words='english',lowercase=True,min_df=10)
X_train_dtm = vect.fit_transform(xtrain)
demo = pd.DataFrame(X_train_dtm.toarray())
demo.columns = vect.get_feature_names_out()
demo

Unnamed: 0,00,000,03,0800,08000839402,08000930705,10,100,1000,10p,...,ya,yar,yeah,year,years,yes,yesterday,yo,yr,yup
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4452,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4453,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
X_test_dtm = vect.transform(xtest)
demotest = pd.DataFrame(X_test_dtm.toarray())
demotest.columns = vect.get_feature_names_out()
demotest

Unnamed: 0,00,000,03,0800,08000839402,08000930705,10,100,1000,10p,...,ya,yar,yeah,year,years,yes,yesterday,yo,yr,yup
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1110,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1111,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1112,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1113,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
nb = MultinomialNB()

nb.fit(X_train_dtm, ytrain)

y_pred_class = nb.predict(X_test_dtm)

print("Number of Features")
print(X_train_dtm.shape[1])
print("Training Accuracy")
print(nb.score(X_train_dtm,ytrain))
print("Testing Accuracy")
print(nb.score(X_test_dtm,ytest))
print("Confusion Matrix")
print(confusion_matrix(ytest,y_pred_class))
print("Classifcation Report")
print(classification_report(ytest,y_pred_class))

Number of Features
695
Training Accuracy
0.9845187345748261
Testing Accuracy
0.9811659192825112
Confusion Matrix
[[954  12]
 [  9 140]]
Classifcation Report
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       966
        spam       0.92      0.94      0.93       149

    accuracy                           0.98      1115
   macro avg       0.96      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [16]:
# Tfidf Vectorizer to create DTM

vect = TfidfVectorizer(stop_words='english',lowercase=True,min_df=10)
X_train_dtm = vect.fit_transform(xtrain)
demo = pd.DataFrame(X_train_dtm.toarray())
demo.columns = vect.get_feature_names_out()
demo

Unnamed: 0,00,000,03,0800,08000839402,08000930705,10,100,1000,10p,...,ya,yar,yeah,year,years,yes,yesterday,yo,yr,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.285859,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.373753,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4452,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4453,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4454,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
X_test_dtm = vect.transform(xtest)
demotest = pd.DataFrame(X_test_dtm.toarray())
demotest.columns = vect.get_feature_names_out()
demotest

Unnamed: 0,00,000,03,0800,08000839402,08000930705,10,100,1000,10p,...,ya,yar,yeah,year,years,yes,yesterday,yo,yr,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.325102,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
nb = MultinomialNB()

nb.fit(X_train_dtm, ytrain)

y_pred_class = nb.predict(X_test_dtm)

print("Number of Features")
print(X_train_dtm.shape[1])
print("Training Accuracy")
print(nb.score(X_train_dtm,ytrain))
print("Testing Accuracy")
print(nb.score(X_test_dtm,ytest))
print("Confusion Matrix")
print(confusion_matrix(ytest,y_pred_class))
print("Classifcation Report")
print(classification_report(ytest,y_pred_class))

Number of Features
695
Training Accuracy
0.9820507067534215
Testing Accuracy
0.9757847533632287
Confusion Matrix
[[959   7]
 [ 20 129]]
Classifcation Report
              precision    recall  f1-score   support

         ham       0.98      0.99      0.99       966
        spam       0.95      0.87      0.91       149

    accuracy                           0.98      1115
   macro avg       0.96      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115

