# Spam Mail Detection using SVM

Support Vector Machine(SVM) is a supervised machine learning algorithm used for both classification and regression. Though we say regression problems as well its best suited for classification. The objective of SVM algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. The dimension of the hyperplane depends upon the number of features. 

Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.

The mail can be ham or spam mail. As we have two classes here so we can use Linear SVM for predicting spam mails.

In [18]:
#import required libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

1. Load the Dataset

In [19]:
#load dataset into pandas DataFrame
raw_data = pd.read_csv('spamham.csv',encoding='ISO-8859-1')

2. Pre-process the dataset

In [20]:
#replace null values with null string
mail_data = raw_data.where((pd.notnull(raw_data)),'')

mail_data.shape

(5572, 2)

In [21]:
print("Display upper 10 rows of Dataframe:\n")

mail_data.head()

Display upper 10 rows of Dataframe:



Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [23]:
#replace spam mail as 0 and ham (non-spam) mail as 1
mail_data.loc[mail_data['Category'] == 'spam', 'Category', ] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category', ] = 1

In [24]:
#seperate text/message in X and label in Y
X = mail_data['Message']
Y = mail_data['Category']

print(X)
print('\n\n------------------------------------------------------------------------------\n\n')
print(Y)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


------------------------------------------------------------------------------


0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


3. Splitting the dataset into train and test data

In [25]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)

4. Extracting features

In [26]:
#transform text data into feature vectors to traint the SVM model using TfidfVectorizer
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

#convert Y_train and Y_test into integer
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

5. Build and train the model using SVM

In [27]:
model = LinearSVC() #creating the model
model.fit(X_train_features, Y_train) #training the model on train data

LinearSVC()

6. Evaluate the model

In [28]:
#checking accuracy by predicting training data

prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)
print("Accuracy on training data: ",accuracy_on_training_data)

Accuracy on training data:  0.9997756338344178


In [29]:
#checking accuracy on test data
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
print("Accuracy on test data: ",accuracy_on_test_data)

Accuracy on test data:  0.9775784753363229


7. Detecting whether mail is spam or ham

In [34]:
input_mail = ['SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info']
#convert text to feature vector 
input_mail_features = feature_extraction.transform(input_mail)

#predicting
prediction = model.predict(input_mail_features)
    #returns list with one element [0] if mail is spam; else returns list with one element [1] if mail is ham.
    
if prediction[0] == 0:
    print("SPAM MAIL!!")
else:
    print("HAM MAIL!!")

SPAM MAIL!!


Our Model is ready!!

# Solving same problem using KNN

In [42]:
#importing KNN
from sklearn.neighbors import KNeighborsClassifier

KNN = KNeighborsClassifier()
KNN.fit(X_train_features, Y_train)

KNeighborsClassifier()

In [43]:
#checking accuracy
prediction_on_data = KNN.predict(X_test_features)
accuracy = accuracy_score(Y_test, prediction_on_data)
print("KNN accuracy: ", accuracy)

KNN accuracy:  0.9013452914798207


In [44]:
input_mail = ['SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info']
#convert text to feature vector 
input_mail_features = feature_extraction.transform(input_mail)

#predicting
prediction = KNN.predict(input_mail_features)
    #returns list with one element [0] if mail is spam; else returns list with one element [1] if mail is ham.
    
if prediction[0] == 0:
    print("SPAM MAIL!!")
else:
    print("HAM MAIL!!")

HAM MAIL!!


We got incorrect answer!! Also the accuracy using KNN is just 0.9013452914798207 whereas with SVM is 0.9775784753363229.

So, for this situation SVM is more preferable.