#  Spam classification with Naive Bayes and Support Vector Machines.

Notebook adapted from https://www.kaggle.com/code/pablovargas/naive-bayes-svm-spam-filtering

### METRICS INTRO



#### Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another). In particular, the actual names for the fields of the matrix are:

* True Positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
* True Negatives (TN): We predicted no, and they don't have the disease.
* False Positives (FP): We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.")
* False Negatives (FN): We predicted no, but they actually do have the disease. (Also known as a "Type II error.")

Table of confusion matrix terms

| | Predicted Positive | Predicted Negative |
| --- | --- | --- |
| **Actual Positive** | True Positive (TP) | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN) |


#### Precision

Precision is the fraction of relevant instances among the retrieved instances. It is defined as the number of true positives divided by the sum of true positives and false positives.

$$Precision = \frac{TP}{TP + FP}$$

#### Recall

Recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. It is defined as the number of true positives divided by the sum of true positives and false negatives.

$$Recall = \frac{TP}{TP + FN}$$


Graphically, the confusion matrix, precision and recall can be visualized as follows:

![Confusion Matrix](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/350px-Precisionrecall.svg.png)

#### F1 Score

The F1 score is the harmonic mean of precision and recall. It is defined as the number of true positives divided by the sum of true positives and half the sum of false positives and false negatives.

The exact equation is:

$$F1 = 2 \frac{precision \times recall}{precision + recall}$$

#### Accuracy

Accuracy is the fraction of predictions our model got right. It is defined as the number of true positives and true negatives divided by the total number of predictions.

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$

## Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, model_selection, naive_bayes, metrics, svm
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline  

## Exploring the Dataset

In [2]:
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head(n=10)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
5,spam,FreeMsg Hey there darling it's been 3 week's n...,,,
6,ham,Even my brother is not like to speak with me. ...,,,
7,ham,As per your request 'Melle Melle (Oru Minnamin...,,,
8,spam,WINNER!! As a valued network customer you have...,,,
9,spam,Had your mobile 11 months or more? U R entitle...,,,


## Feature engineering

Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a dictionary of features and transform documents to feature vectors.<p>
**We remove the stop words in order to improve the analytics**

In [3]:
f = feature_extraction.text.CountVectorizer(stop_words = 'english')
X = f.fit_transform(data["v2"])
np.shape(X)

(5572, 8404)

We have created more than 8400 new features. The new feature $j$ in the row $i$ is equal to 1 if the word $w_{j}$ appears in the text example $i$. It is zero if not.

## Predictive Analysis

First we transform the variable spam/non-spam into binary variable, then we split our data set in training set and test set. 

In [4]:
data["v1"]=data["v1"].map({'spam':1,'ham':0})
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, data['v1'], test_size=0.2, random_state=42)
print([np.shape(X_train), np.shape(X_test)])

[(4457, 8404), (1115, 8404)]


### Multinomial naive bayes classifier

We train different bayes models changing the regularization parameter $\alpha$. <p>
We evaluate the accuracy, recall and precision of the model with the test set.

In [5]:
list_alpha = np.arange(1/100000, 20, 0.11)
score_train = np.zeros(len(list_alpha))
score_test = np.zeros(len(list_alpha))
recall_test = np.zeros(len(list_alpha))
precision_test= np.zeros(len(list_alpha))
f1_test = np.zeros(len(list_alpha))
count = 0
for alpha in list_alpha:
    bayes = naive_bayes.MultinomialNB(alpha=alpha)
    bayes.fit(X_train, y_train)
    score_train[count] = bayes.score(X_train, y_train)
    score_test[count]= bayes.score(X_test, y_test)
    recall_test[count] = metrics.recall_score(y_test, bayes.predict(X_test))
    precision_test[count] = metrics.precision_score(y_test, bayes.predict(X_test))
    f1_test[count] = metrics.f1_score(y_test, bayes.predict(X_test))
    count = count + 1 

Let's see the first 10 learning models and their metrics!

In [6]:
matrix = np.matrix(np.c_[list_alpha, score_train, score_test, recall_test, precision_test, f1_test])
models = pd.DataFrame(data = matrix, columns = 
             ['alpha', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision', 'Test F1 Score'])
models.head(n=10)

Unnamed: 0,alpha,Train Accuracy,Test Accuracy,Test Recall,Test Precision,Test F1 Score
0,1e-05,0.998205,0.974888,0.92,0.896104,0.907895
1,0.11001,0.997083,0.974888,0.94,0.88125,0.909677
2,0.22001,0.997083,0.975785,0.933333,0.89172,0.912052
3,0.33001,0.996186,0.973991,0.933333,0.880503,0.906149
4,0.44001,0.995961,0.976682,0.94,0.892405,0.915584
5,0.55001,0.995513,0.974888,0.926667,0.891026,0.908497
6,0.66001,0.995513,0.978475,0.926667,0.914474,0.92053
7,0.77001,0.995288,0.979372,0.926667,0.92053,0.923588
8,0.88001,0.995064,0.979372,0.926667,0.92053,0.923588
9,0.99001,0.995064,0.979372,0.926667,0.92053,0.923588


I select the model with the highest F1 score

In [7]:
best_index = models['Test F1 Score'].idxmax()
models.iloc[best_index, :]

alpha             3.080010
Train Accuracy    0.991698
Test Accuracy     0.983857
Test Recall       0.913333
Test Precision    0.964789
Test F1 Score     0.938356
Name: 28, dtype: float64

#### Confusion matrix with naive bayes classifier

In [8]:
m_confusion_test = metrics.confusion_matrix(y_test, bayes.predict(X_test))
pd.DataFrame(data = m_confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,965,0
Actual 1,38,112


## Support Vector Machine

### Quick introduction to SVM

Support Vector Machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

* Effective in high dimensional spaces.
* Still effective in cases where number of dimensions is greater than the number of samples.
* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
* Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

* If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
* SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

#### Support vectors

The main idea behind support vector machines is to find a hyperplane that best separates the classes. The hyperplane is defined by the equation:

$$w^Tx + b = 0$$

where $w$ is the normal vector to the hyperplane and $b$ is the bias. The hyperplane is the line that separates the classes. The support vectors are the data points that are closest to the hyperplane. The distance between the hyperplane and the support vectors is called the margin. The goal is to maximize the margin.

![Support vectors visualization](https://static.wixstatic.com/media/8f929f_7ecacdcf69d2450087cb4a898ef90837~mv2.png)


#### Kernel functions

In case of linearly inseparable data, we can use a kernel function to transform the data into a higher dimensional space. In this space, the data is linearly separable. The kernel function is used to compute the dot product of two vectors in a higher dimensional space. The kernel function is defined as:

$$K(x, y) = \phi(x)^T \phi(y)$$

where $\phi(x)$ is a mapping from the original space to a higher dimensional space.

The most common kernel functions are:

* Linear kernel: $K(x, y) = x^T y$
* Polynomial kernel: $K(x, y) = (x^T y + c)^d$
* Radial basis function (RBF) kernel: $K(x, y) = \exp(-\gamma \|x - y\|)$$
* Sigmoid kernel: $K(x, y) = \tanh(\gamma x^T y + r)$


![Kernel](https://miro.medium.com/max/838/1*gXvhD4IomaC9Jb37tzDUVg.png) 


### Training the SVM model

We train different models changing the regularization parameter C. <p>
We evaluate the accuracy, recall and precision of the model with the test set.

In [9]:
list_C = np.arange(500, 2000, 100) #100000
score_train = np.zeros(len(list_C))
score_test = np.zeros(len(list_C))
recall_test = np.zeros(len(list_C))
precision_test= np.zeros(len(list_C))
f1_test= np.zeros(len(list_C))
count = 0
for C in list_C:
    svc = svm.SVC(C=C, kernel='rbf')
    svc.fit(X_train, y_train)
    score_train[count] = svc.score(X_train, y_train)
    score_test[count]= svc.score(X_test, y_test)
    recall_test[count] = metrics.recall_score(y_test, svc.predict(X_test))
    precision_test[count] = metrics.precision_score(y_test, svc.predict(X_test))
    f1_test[count] = metrics.f1_score(y_test, svc.predict(X_test))
    count = count + 1 

Let's see the first 10 learning models and their metrics!

In [10]:
matrix = np.matrix(np.c_[list_C, score_train, score_test, recall_test, precision_test, f1_test])
models = pd.DataFrame(data = matrix, columns = 
             ['C', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision', 'Test F1 Score'])
models.head(n=10)

Unnamed: 0,C,Train Accuracy,Test Accuracy,Test Recall,Test Precision,Test F1 Score
0,500.0,1.0,0.975785,0.826667,0.992,0.901818
1,600.0,1.0,0.975785,0.826667,0.992,0.901818
2,700.0,1.0,0.975785,0.826667,0.992,0.901818
3,800.0,1.0,0.975785,0.826667,0.992,0.901818
4,900.0,1.0,0.975785,0.826667,0.992,0.901818
5,1000.0,1.0,0.975785,0.826667,0.992,0.901818
6,1100.0,1.0,0.975785,0.826667,0.992,0.901818
7,1200.0,1.0,0.975785,0.826667,0.992,0.901818
8,1300.0,1.0,0.975785,0.826667,0.992,0.901818
9,1400.0,1.0,0.975785,0.826667,0.992,0.901818


I select the model with the highest F1 score

In [11]:
best_index = models['Test F1 Score'].idxmax()
models.iloc[best_index, :]

C                 500.000000
Train Accuracy      1.000000
Test Accuracy       0.975785
Test Recall         0.826667
Test Precision      0.992000
Test F1 Score       0.901818
Name: 0, dtype: float64

#### Confusion matrix with support vector machine classifier.

In [12]:
m_confusion_test = metrics.confusion_matrix(y_test, svc.predict(X_test))
pd.DataFrame(data = m_confusion_test, columns = ['Predicted 0', 'Predicted 1'],
            index = ['Actual 0', 'Actual 1'])

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,964,1
Actual 1,26,124


## Use custom kernel

In [13]:
# Let's try to use custom kernel
def my_kernel(X, Y):
    return np.dot(X, Y.T)

svc = svm.SVC(kernel=my_kernel)
svc.fit(X_train, y_train)
print('Test accuracy:', svc.score(X_test, y_test))
print('Test F1 score:', metrics.f1_score(y_test, svc.predict(X_test)))

Test accuracy: 0.9802690582959641
Test F1 score: 0.9214285714285715
