# Spam Detection

## Dataset Description

### 1. Description

This dataset comes from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/spambase). It consists in 4,601 sms messages that are described as either spam or ham. Each message is characterized by 57 continuous features. The 58th is the classification as spam or ham. There are 1813 (39.4%) messages classified as spam and 2788 (60.6%) as non-spam.

The first 48 columns represent the frequency of specific words. Values are real and their range spans from 0 to 100.
The following 6 columns depict the frequency of specific character. Values are also real and have the same span than the before ones.
The last 3 columns display some statistics about the capital letters. Values are real and their range spans from 0 to (+)infinity.

More information on this dataset can be found in files "spambase.DOCUMENTATION" and "spambase.names".

### 2. Importing libraries

In [1]:
import pandas as pd
import numpy as np

### 3. Importing the dataset

In [2]:
# Importing the dataset
data = pd.read_csv('spambase.data').as_matrix()

## Naive Bayes

First of all, a naive Bayes model is used to classify the sms messages. This model will be useful to assess the performance of future ones.

### Without K-fold

#### 1. Importing libraries

The library MultinomialNB from sklearn is used to run a Naive Bayes classification.
It accepts number values as input and not text data, which is ok as our data consists in 57 features of continuous values.

In [4]:
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np

#### 2. Splitting into training and testing sets

The dataset is split into training and testing sets with a 80-20 ratio.

In [5]:
# Training and testing sets
from sklearn.model_selection import train_test_split
Xtrain,Xtest,Ytrain,Ytest = train_test_split(data[:,0:56], data[:,57], test_size = 0.2)

#### 3. Multinomial Naive Bayes

In [6]:
# Multinomial Naive Bayes
model = MultinomialNB()
model.fit(Xtrain, Ytrain)
print("Classification rate for NB:", model.score(Xtest, Ytest))

Classification rate for NB: 0.842391304348


The result of one run of multinomial naive Bayes already depicts good result.
To be sure it is not a lucky shot, a K-fold multinomial naive Bayes is performed just after.

### With K-fold

In [13]:
# K-fold
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True)
score_NB = 0
for train_index, test_index in kf.split(data):
    Xtrain, Xtest = data[train_index,0:56], data[test_index,0:56]
    Ytrain, Ytest = data[train_index,57], data[test_index,57]
    
    # Multinomial Naive Bayes
    model = MultinomialNB()
    model.fit(Xtrain, Ytrain)
    score = model.score(Xtest, Ytest)
    score_NB += score
    print("Classification rate for NB:", score)
score_NB /= kf.get_n_splits()
print("The overall mean rate is:",score_NB)

Classification rate for NB: 0.865217391304
Classification rate for NB: 0.839130434783
Classification rate for NB: 0.823913043478
Classification rate for NB: 0.810869565217
Classification rate for NB: 0.85652173913
Classification rate for NB: 0.867391304348
Classification rate for NB: 0.832608695652
Classification rate for NB: 0.839130434783
Classification rate for NB: 0.819565217391
Classification rate for NB: 0.830434782609
The overall mean rate is: 0.83847826087


The overall mean score is quite close to the one without the K-fold, which proves that results of multinomial naive Bayes are quite resilient for this dataset.

## Artificial Neural Network

An artificial neural network (ANN) is then run to classify the sms messages. It is known that convolutional neural networks (CNN) can usually be used to classify text data. However, the data collected in this case consists in numeric values. Therefore, an ANN is used.

### Data Pre-processing

#### 1. Importing libraries

In [14]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### 2. Importing the dataset

In [15]:
# Importing the dataset
data = pd.read_csv('spambase.data')
X = data.iloc[:, 0:57].values
Y = data.iloc[:, 57].values

#### 3. Splitting data into training and testing sets

The dataset is split into training and testing sets with a 80-20 ratio.

In [16]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size = 0.2)

#### 4. Feature scaling

Feature scaling is used to scale all the features so that none is more important than another in the naive Bayes model.

In [17]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
Xtrain = sc.fit_transform(Xtrain)
Xtest = sc.transform(Xtest)

### Artificial Neural Network

The Keras library is used to create and execute the artificial neural network.

#### 1. Importing libraries

In [18]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense

#### 2. First configuration (5 x Relu)

This first configuration is a five-layers ANN. The input dimension is 57 (as there are 57 dimensions) while the output dimension is one. The intermediate layers have 29 units (using the thumb rule (57+1)/2).

The activation function is a 'relu' for each of the intermediate layer, except the output layer which has a 'sigmoid' one.

The number of iterations is arbitrarily fixed at 100 while the batch size is 10.

In [23]:
# Initialising the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 29, kernel_initializer = 'uniform', activation = 'relu', input_dim = 57))

# Adding the second hidden layer
classifier.add(Dense(units = 29, kernel_initializer = 'uniform', activation = 'relu'))

# Adding the third hidden layer
classifier.add(Dense(units = 29, kernel_initializer = 'uniform', activation = 'relu'))

# Adding the fourth hidden layer
classifier.add(Dense(units = 29, kernel_initializer = 'uniform', activation = 'relu'))

# Adding the fifth hidden layer
classifier.add(Dense(units = 29, kernel_initializer = 'uniform', activation = 'relu'))

# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set
classifier.fit(Xtrain, Ytrain, batch_size = 10, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x2b083c9d6a0>

Once the model has been trained on train data, it can be assessed by using test data.

In [24]:
# Predicting the Test set results
y_pred = classifier.predict(Xtest)
y_pred = (y_pred > 0.5)

The confusion matrix as well as the accuracy is computed and printed.

In [25]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Ytest, y_pred)
print(cm)
print((cm[0][0]+cm[1][1])/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1]))

[[528  20]
 [ 26 346]]
0.95
