# Spam E-mail Classifier

## Abstract

The purpose of this project is to create a filter which can determine whether or not an incoming message is spam or not. The training data consisted of 4601 rows each with 58 features. The features observed included word frequency using the equation:

$$ 100 \cdot \frac{\mbox{Number of times word appears}}{\mbox{Total number of words}} $$

We were able to create a model which could predict the label of a spam or ham email with an 89.8% accuracy.

## Introduction

Spam emails are familiar to anyone with an email address. They can range from the pesky advertisments from retailers to the insidious phishing emails from hackers. All reputable email services offer some form of spam filter which classifies emails based on there content and segregates those deemed to be spam. This project seeks to replicate this process using a data set obtained from https://www.kaggle.com/monizearabadgi/spambase. The data is pre-labeled as either ham or spam. The method we will use to determine wheater an email is spam or not is to look at word frequency and sequences of capital letters. This should be an effective method as spam emails tend to contain different words than regularly composed emails.

## Research Questions

* Can a reliable spam filter be created by observing word frequency in labeled email data?
* What level of accuracy can a classification model achieve when classifying spam emails?



In [30]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

# Load data from csv
spam_data = pd.read_csv("../Data/spambase.csv")

# Check if there are any missing values for any of the features.
print(spam_data.isnull().any())

# The data was already clean so all I needed to do was split it into features and labels
X = spam_data.iloc[:,:-1] 
y = spam_data.iloc[:,-1] 

make                 False
address              False
all                  False
3d                   False
our                  False
over                 False
remove               False
internet             False
order                False
mail                 False
receive              False
will                 False
people               False
report               False
adresses             False
free                 False
business             False
email                False
you                  False
credit               False
your                 False
font                 False
000                  False
money                False
hp                   False
hpl                  False
george               False
650                  False
lab                  False
labs                 False
telnet               False
857                  False
data                 False
415                  False
85                   False
technology           False
1999                 False
p

In [31]:
# Train an svm on the labeled data to learn about ham and spam emails
svm_classifier = svm.SVC(gamma=0.001, C=100.)
svm_scores = cross_val_score(svm_classifier, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (svm_scores.mean(), svm_scores.std() * 2))

Accuracy: 0.87 (+/- 0.10)


In [32]:
# Train a K-nearest neighbors on the labeled data
knn_classifier = KNeighborsClassifier()
knn_scores = cross_val_score(knn_classifier, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (knn_scores.mean(), knn_scores.std() * 2))

Accuracy: 0.77 (+/- 0.07)


In [33]:
# Train a Decision tree on the labeled data
dtree_classifier = DecisionTreeClassifier(random_state=0)
dtree_scores = cross_val_score(dtree_classifier, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (dtree_scores.mean(), dtree_scores.std() * 2))

Accuracy: 0.89 (+/- 0.10)


## Results

Out of the three classifications algorithms that I used the decision tree and the support vector machine performed very similarly. Each correctly predicted 90 percent of the test data on average, with a 10 percent standard deviation. The k-nearest neighbors performed the worst, correctly predicting 77 percent of the test data with a standard deviation of 7 percent.

## Conclusion

Since there is very little difference between the performance of either the svm or the decision tree we could choose either one with similar results to filter spam. Based on the high level of accuracy achived by both of these methods we can be safe in our assumption that the data provided was sufficent and the method of learning was appropriate.

## Limitations and Future Work

One major limitation to this work was that all of the emails were delivered to and labeled by a single person. To remedy this we could look at emails collected and labeled by multiple people to create a more accurate and precise spam filter. In order to improve this algorithm we could integrate it with an email app so that anytime a message is labeled as spam or not spam it would update the dataset. This would require the creation of an algorithm to convert an emial message into its word count frequency. Another future improvement to this project could be to determine what sort of emails this is mis labeling. By knowing if it is missing spam emails or innapropriatly labeling ham as spam we could tweek hyper-parameters to improve its accuracy.


