# Is this email a spam?

Many email services today provide spam filters that are able to classify emails into spam and non-spam email with high accuracy. In this article I would like to cover a simple approach to detect if an email is a spam or not building my own spam filter from scratch!

The first time I heard about this problem, I wondered my self how is possible to achieve this having in input a dataset of just text (the emails). Machine learning algorithms need to be fed with numbers, they do not understand strings. So we need to find a method to transform the dataset, which is basically composed by a sequence of strings, into a sequence of numbers.<br>
Let's give a glance at one possible idea.

## Idea

One possible approach to adopt is the following:

  
  1. first thing we need is a dataset containing the most frequently occurring spam words (here we call it **SD**, Spam Dataset); <br><br>For each email within our dataset we do the following steps: <br><br>
  2. create a list of integer with all the values set to $0$ with the same length of *SD* (here we call this list **EL**, Email List);
    
  3. next, to fill the list of integer properly, we use the following idea:<br>
        if the first spamming word cointained in *SD* is present in the email took into account, then we assign the value $1$ in the first cell of EL. <br>Continue with this approach considering all the spamming words in *SD*.
    
Doing so, we end up with a list of ones and zeros for each email.

Maybe an example would be explanatory if you are a bit confusing.

Consider the following email:
    <center>"<i>Artificial   intelligence   will   destroy   humanity!</i>"</center>
    
Let's say our *SD* is composed by the following spam words (step 1):
- artificial
- dollars
- cost
- less
- destroy
- money
- free

Consequently our *EL* would be the following list (step 2):
$$[0, 0, 0, 0, 0, 0, 0]$$

After the third step, we end up with this list: 
$$[1, 0, 0, 0, 1, 0, 0]$$

If we continue by doing this for each email within our dataset, we manage to create a dataset of numbers and we  finally have something to pass to a learning algorithm.

Let's implement it!

First, here we load the `email.csv` dataset and take a look at the percentage of the spam and not-spam emails.

In [12]:
from IPython.core.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 17px; }</style>"))

df = pd.read_csv('emails.csv')
#df['spam'].value_counts().plot(kind = 'pie', explode = [0, 0.1], figsize = (6, 6), autopct = '%1.1f%%', shadow = True)
#plt.ylabel("Spam vs not spam")
#plt.legend(["Not spam", "Spam"])
#plt.show()

<img src="article/spam-email/img/1.png" alt="" style="width: 1000px;"/>

In order to convert each email into a numeric representation, I have created the following util functions. The most interesting one is the function `process_email` that performs useful string operation. To cite an example, if the email we are processing contains some numbers, those are replaced with the word *number* (line of code 30). <br>Moreover, often the emails contains html tags. Since these tags could be tricky to manage, we simply get rid of them (line of code 24).

### Functions

In [1]:
# reads the fixed vocabulary list in vocab.txt and returns a cell array of the words in vocabList
def get_vocabList():
    vocab = pd.read_table('vocab.txt', delim_whitespace=True, header=None)
    vocab = vocab.iloc[:,-1].values.tolist()
    return vocab
       

# preprocesses the body of an email and returns a list of indices of the words contained in the email
def process_email(email, is_file):
    
    # read email from a file is `is_file` is True, otherwise the parameter email is directly passed as a string 
    if is_file == True:
        with open(email, 'r', encoding='ISO-8859-1') as myfile:
            processed_email = myfile.read()
            original_email = processed_email
    else:
        processed_email = email
        original_email = email
            
    vocabList = get_vocabList()
    
    
    # set all character in lower case
    processed_email = str(processed_email).lower()

    # strip all HTML
    processed_email = re.sub('<[^<]+?>', '', processed_email)

    # replace any number into the string 'number'
    processed_email = re.sub('[0-9]+', 'number ', processed_email) 

    # replace strings starting with http:// or https:// with the string 'httpaddr'
    processed_email = re.sub('(http|https)://[^\s]*', 'httpaddr ', processed_email) 

    # replace email with string 'emailaddr'
    processed_email = re.sub('[^\s]+@[^\s]+', 'emailaddr ', processed_email) 

    # replace dollar sign ($) with string 'dollar'
    processed_email = re.sub('[$]+', 'dollar ', processed_email) 
    
    # delete non-alphanumeric characters
    processed_email = re.sub('[^a-zA-Z0-9]', ' ', processed_email)
    
    word_indices = np.array([])
    for word in processed_email.split():
        ps = PorterStemmer()
        # stem the word
        word = ps.stem(word)
        
        # if the word is in vocabList, insert in word_indices array the word's index of vocabList 
        for index in range(len(vocabList)):
            if vocabList[index] == word:
                word_indices = np.append(word_indices, index)
    return word_indices, original_email, vocabList, processed_email


# takes in a word_indices vector and produces a feature vector from the word indices
def email_feature(word_indices, vocabList):
    x = np.zeros(len(vocabList))
    for index in range(len(word_indices)):
        x[int(word_indices[index])] = 1
    return x

Let's take a look at some of the spamming words contained in the Spam Dataset:

In [14]:
from wordcloud import WordCloud

#spam_wordcloud = WordCloud(width=600, height=400).generate(" ".join(vocab))
#plt.figure( figsize=(10,8), facecolor='k')
#plt.imshow(spam_wordcloud)
#plt.axis("off")
#plt.tight_layout(pad=0)
#plt.show()

<img src="article/spam-email/img/2.png" alt="" style="width: 1000px;"/>

### Convert emails into their numeric representation 

Now we are going to apply the idea described earlier with the aim to convert emails into their numeric representation.

In [16]:
emails = pd.read_csv('emails.csv')
vocab = get_vocabList()
X = emails.iloc[:, 0]
y = emails.iloc[:, 1]
X_rec = np.zeros((X.shape[0], len(vocab)))

# process each mail separately
for m in range(len(X)):
    word_indices, original_email, vocabList, processed_email = process_email(X[m], is_file=False)
    feature = email_feature(word_indices, vocabList)
    X_rec[m, :] = feature

  # Remove the CWD from sys.path while we load stuff.


Done! The training time has come! <br>I have decided to use two learning algorithms:
- Logistic regression
- Neural network

## Training using Logistic regression model

For training and testing the model I decided to split the dataset into a *trainig set* and a *test set* (70 - 30). Actually, splitting the dataset in just two chunks (train + test) is a bad practice. It should be splitted in three different dataset: train + validation + test set. But for the purpose of this article train + test is just fine since we are not going to tune any model parameters.

In [24]:
testset_size = 1718
X_rec_train = X_rec[:(X_rec.shape[0] - testset_size)]
y_train = y[:(y.shape[0] - testset_size)]

X_rec_test = X_rec[(X_rec.shape[0] - testset_size):]
y_test = y[(X_rec.shape[0] - testset_size):]
X_rec_train.shape, y_train.shape, X_rec_test.shape, y_test.shape

((4010, 1899), (4010,), (1718, 1899), (1718,))

In [27]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, solver='lbfgs')
m = clf.fit(X_rec_train, y_train)

### Testing model using test set

In [28]:
print("Accurcay: {0:.2f}".format((m.predict(X_rec_test) == y_test).sum()/X_rec_test.shape[0]*100), "%")

Accurcay: 98.72 %


An accuracy of 98.72% is formidable in spite of we did not tune any parameter! Let's see if a neural network can do better.

## Trainig using Neural Network model

In [29]:
from keras.models import Sequential
from keras.layers import Dense
import numpy

# fix random seed for reproducibility
numpy.random.seed(7)
emails = pd.read_csv('emails.csv')
testset_size = 1500
X_rec_train = X_rec[:(X_rec.shape[0] - testset_size)]
y_train = y[:(y.shape[0] - testset_size)]

X_rec_test = X_rec[(X_rec.shape[0] - testset_size):]
y_test = y[(X_rec.shape[0] - testset_size):]

# create model
model = Sequential()
model.add(Dense(12, input_dim=1899, activation='relu'))
model.add(Dense(1899, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Using TensorFlow backend.


In [30]:
model.fit(X_rec_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ffb65ad0390>

### Testing model using test set 

In [31]:
scores = model.evaluate(X_rec_test, y_test)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))


acc: 98.53%


Neural network does a good job as well with an accuracy of 98.53%. It can actually do better if we tune parameters and choose a better architecture.

## Testing the spam classifier using my emails

`Now the cool part!` <br>
Here we test the spam classifier we have just built with our emails! <br>
(For those who want to test the model with their emails, just edit the file `my_email.txt` by pasting your email).

In [45]:
my_email = 'my_email.txt'
word_indices, original_email, vocabList, processed_email = process_email(my_email, is_file=True)
X = email_feature(word_indices, vocabList)

  # Remove the CWD from sys.path while we load stuff.


The email I want to test is a spam email I recently received, let's see if the model is tough enough to recognize the spam! <br>
The email is the following:

In [42]:
original_email

'How can you make time to earn your MBA?\nStart with choosing a program that is as flexible as it is valuable, like the iMBA.\nPatricia, a full-time working mother and iMBA student...\n"Getting my MBA makes me feel empowered. I donâ\x80\x99t need to stop working, I donâ\x80\x99t need to stop being a mother, I donâ\x80\x99t need to stop having my life, and that is everything to me.â\x80\x9d\nContinue Reading\nUPCOMING WEBINAR\n\nThursday, June 20\n\nThis webinar will provide an overview of the iMBA program, student experience and interactive curriculum. It will also discuss admission requirements, and participants will have an opportunity to ask questions for the admissions panel.\nRSVP Now\n'

In [43]:
with open(my_email, 'r', encoding='ISO-8859-1') as myfile:
            processed_email = myfile.read()

In [46]:
print("This is email is: ", np.where(np.round(model.predict(X.reshape(-1,1).T)) == 1, "Spam", "Not spam")[0][0])

This is email is:  Spam


The model recognizes correctly it as a spam, superb!