# Spam Mail Classification Using Machine Learning

## Introduction

With the rise of internet and ease of access to computing devices, spam mail has become extremely common. While in most cases, it might be another unwanted annoying notification, some spam may be a security risk. Often scammers send emails acting as other entities in order to gain access to users' personal information (bank account details, address, and SSN, to name a few). 

People who are not tech-savvy may fall prey to these devious techniques, making a spam mail filter crucial to protect such users.

In [36]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer as tf
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Data Preprocessing and Exploratory Data Analysis

In [17]:
# Loading the data
mail = pd.read_csv("mail_data.csv")

# Let's see what the first 5 rows look like
print(mail.head())

  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...


In [18]:
# Let's clean the data by removing null data
clean_mail = mail.where(pd.notnull(mail), '')
print(clean_mail.head())

# Checking dimensions of dataset
clean_mail.shape

  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...


(5572, 2)

In [28]:
# Replacing spam and ham with numerical values (spam = 0, ham = 1)
clean_mail.loc[clean_mail["Category"] == "spam", "Category",] = 0
clean_mail.loc[clean_mail["Category"] == "ham", "Category",] = 1

In [31]:
# Separating label from message 
X = clean_mail["Message"]
y = clean_mail["Category"]

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object 0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


In [33]:
# Creating a training and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 42)

print(X_train.shape, X_test.shape)

(4457,) (1115,)


In [51]:
# Feature extraction — converting text into numerical values
transformed = tf(min_df = 1, stop_words = "english", lowercase = True)

"""The min_df argument is set to 1 for quality purposes. If a word only occurs once in the dataset, no score is assigned to it""" 

# Converting X_train and X_test to the transformed data
X_train_transformed = transformed.fit_transform(X_train)
X_test_transformed = transformed.transform(X_test)

# Converting y_train and y_test as integers
y_train = y_train.astype("int")
y_test = y_test.astype("int")

print(X_train_transformed)

  (0, 1713)	0.7071067811865476
  (0, 2392)	0.7071067811865476
  (1, 5241)	0.4375307310285687
  (1, 6959)	0.28353533329007485
  (1, 7160)	0.5436306308668579
  (1, 6539)	0.5043342820364712
  (1, 7400)	0.4222407409615414
  (2, 5832)	0.44558977651701576
  (2, 4721)	0.47479780441738884
  (2, 5801)	0.4979815204239224
  (2, 2982)	0.4979815204239224
  (2, 7421)	0.28292332285709515
  (3, 5831)	1.0
  (4, 4738)	0.3204169987043922
  (4, 1583)	0.3549147356597047
  (4, 4157)	0.5534651020906459
  (4, 2788)	0.5342906086326665
  (4, 3177)	0.4237669213702235
  (5, 4009)	0.34391007706121646
  (5, 7131)	0.2506764475916244
  (5, 2297)	0.3741084487077238
  (5, 5720)	0.4810359172341316
  (5, 6415)	0.30653767941680865
  (5, 1757)	0.5045242316078323
  (5, 7105)	0.3146814949645023
  :	:
  (4454, 4185)	0.2922127971401536
  (4454, 4416)	0.2885854848504585
  (4454, 3153)	0.41710096601362584
  (4454, 4103)	0.41710096601362584
  (4455, 6123)	0.5957506815947425
  (4455, 3887)	0.6142237248000665
  (4455, 4016)	0.51750

## The Machine Learning Model

In [63]:
# Initiate logistic regression
logreg = LogisticRegression()

# Fit the model on the training values
logreg.fit(X_train_transformed, y_train)

# Evaluating the accuracy of the model
training_predict = logreg.predict(X_train_transformed)
training_acc = accuracy_score(y_train, training_predict)
print("The training accuracy is " +str(training_acc*100))

The training accuracy is 96.85887368184878


In [65]:
# Evaluating test accuracy
test_predict = logreg.predict(X_test_transformed)
test_acc = accuracy_score(y_test, test_predict)
print("The training accuracy is " +str(test_acc*100))

The training accuracy is 95.96412556053812


In [83]:
# Testing out the model
input = ["URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18"]

# Convert text into numbers
input_transformed = transformed.transform(input)

# Predicting outcome
predict = logreg.predict(input_transformed)

if predict == 1:
    print("This is a ham mail")
else:
    print("This is a spam mail")

This is a spam mail
