# Spam detection using ML

Hao Meng 2020.7

Rewrite in the notebook in 2021.10


## Introduction

In my previous notebook, I analyzed how to detect malware using Supervised learning of machine learning. In this notebook, I will discuss how to detect spam emails using machine learning. 

Detection of email is different from malware as email only has some words. 

In research from Graham (2002), he got a high accuracy of spam email detection by Naive Bayesian filter. I will briefly explain the principle of the Naive Bayesian filter. 

**1. Bayesian Inference**

In general, it is conditional probability. For example. In total, the probability of rain in tomorrow is 0.3. However, if many clouds appear in today's evening, the probability of rain in tomorrow will increase to 0.5. The condition is that colouds appear in the evening of previous day.

In spam email detection, we assume the probability of the spam email is 0.5. However, if "sex" appears in email, the probability of the spam email will increase to 0.8. We can statstic the probability of some key words similar to "sex" from all samples emails.

**2. Naive Bayesian**

It can't just judge whether it is a spam email by one word. There are more keywords like "Gamble," "smoke," and so on. If every word's conditional probability is independent, the Bayesian Inference method will be transferred to the Naive Bayesian Inference method. And the function of the combine probabilities is:

![Combine Probabilities](https://chart.googleapis.com/chart?cht=tx&chl=P%3D%5Cfrac%7BP_%7B1%7DP_%7B2%7D%5Ccdot%20%5Ccdot%20%5Ccdot%20P_%7B15%7D%7D%7BP_%7B1%7DP_%7B2%7D%5Ccdot%20%5Ccdot%20%5Ccdot%20P_%7B15%7D%2B(1-P_%7B1%7D)(1-P_%7B2%7D)%5Ccdot%20%5Ccdot%20%5Ccdot%20(1-P_%7B15%7D)%7D&chs=70)

The P with a subscript represents one word's condition probability, and in the function, there are 15 words to calc the spam probability together. Later more words can be added to the function.

### Dataset

This notebook will use Ling-Spam Dataset provided by Ion Androutsopoulos(https://www.kaggle.com/mandygu/lingspam-dataset?rvi=1).


### Reference Materials

In this notebook, I will use the feature engineering method mainly from the notebook(https://www.kaggle.com/surekharamireddy/spam-detection-with-99-accuracy).

### Flow

1. Feature engineering
2. Vectorlize
3. Analysing by some models
4. Analysing by Naive Bayesian model


**Reference**
Graham, P. (2002, August). A Plan for Spam. Paul Graham. http://www.paulgraham.com/spam.html

In [17]:
# initial
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, \
    ExtraTreesClassifier
from collections import Counter
import string
import warnings

warnings.filterwarnings('ignore')


BASE_DIR = "/media/xueshan/WD_BLACK/cybersecurity/spam"
DATA_DIR = os.path.join(BASE_DIR, "lingspam-dataset")


## 1. Feature engineering

In [28]:
db_file = os.path.join(DATA_DIR, "messages.csv")
df=pd.read_csv(db_file)
df.head()

Unnamed: 0,subject,message,label
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0
1,,"lang classification grimes , joseph e . and ba...",0
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0
3,risk,a colleague and i are researching the differin...,0
4,request book information,earlier this morning i was on the phone with a...,0


In [29]:
df['message'] = df['message'].str.lower()

In [30]:
# check null data
df.isnull().sum()

subject    62
message     0
label       0
dtype: int64

In [31]:
df.fillna(" ", inplace=True)
df.isnull().sum()

subject    0
message    0
label      0
dtype: int64

In [32]:
# combine subject and message
df['integration_msg']=df['subject']+df['message']
df.drop('subject',axis=1,inplace=True)

df['label'].value_counts()


0    2412
1     481
Name: label, dtype: int64

## 2. Vectorlize

This phase is crucial. Text data is different from other data like integer, float, which ML models can't use. It has to be transferred into a vector. 

The target vector should be based on the word frequent. For example, the number of times sex appears in an email is a usable feature. But first, the email should be processed to be more readable by the machine.

In [33]:
import re

def decontact(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

df['integration_msg']=df['integration_msg'].apply(decontact)
#the number is useless
df['integration_msg']=df['integration_msg'].str.replace(r'\d+(\.\d+)?', 'numbers')
#Converting message to lowercase
df['integration_msg']=df['integration_msg'].str.lower()
# replacing line break with ' '
df['integration_msg']=df['integration_msg'].str.replace(r'\n'," ") 
# replacing email 
# df['integration_msg']=df['integration_msg'].str.replace(r'^.+@[^\.].*\.[a-z]{2,}$','MailID')
# replacing urls 
# df['integration_msg']=df['integration_msg'].str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$','Links')
# replacing currency signs by 'Money', we don't care about the number
df['integration_msg']=df['integration_msg'].str.replace(r'£|\$', 'Money')
# replacing large white space by single white space
df['integration_msg']=df['integration_msg'].str.replace(r'\s+', ' ')

# replacing leading and trailing white space by single white space
df['integration_msg']=df['integration_msg'].str.replace(r'^\s+|\s+?$', '')
# replacing contact numbers 
# df['integration_msg']=df['integration_msg'].str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$','contact number')
# replacing special characters by ' '
df['integration_msg']=df['integration_msg'].str.replace(r"[^a-zA-Z0-9]+", " ")

In [34]:
df['integration_msg'][1]

'lang classification grimes joseph e and barbara f grimes ethnologue language family index pb isbn numbers numbers numbers numbers vi numbers pp Money numbers numbers summer institute of linguistics this companion volume to ethnologue languages of the world twelfth edition lists language families of the world with sub groups shown in a tree arrangement under the broadest classification of language family the language family index facilitates locating language names in the ethnologue making the data there more accessible internet academic books sil org languages reference lang culture gregerson marilyn ritual belief and kinship in sulawesi pb isbn numbers numbers numbers numbers ix numbers pp Money numbers numbers summer institute of linguistics seven articles discuss five language groups in sulawesi indonesia the primary focus is on cultural matters with some linguistic content topics include traditional religion and beliefs certain ceremonies and kinship internet academic books sil or

Now we have to remove stop words which has no contribution to detection.

In [36]:
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
# removing stopwords 
stop = stopwords.words('english')
df['concise_text'] = df['integration_msg'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [37]:
df.drop('message',axis=1,inplace=True)
df.drop('integration_msg',axis=1,inplace=True)

In [38]:
df.head()

Unnamed: 0,label,concise_text
0,0,job posting apple iss research centercontent l...
1,0,lang classification grimes joseph e barbara f ...
2,0,query letter frequencies text identificationi ...
3,0,riska colleague researching differing degrees ...
4,0,request book informationearlier morning phone ...


In [44]:
df.shape

(2893, 2)

In [42]:
tvec = TfidfVectorizer()
df_vec = tvec.fit_transform(df.concise_text)

In [46]:
df_vec

<2893x56901 sparse matrix of type '<class 'numpy.float64'>'
	with 508795 stored elements in Compressed Sparse Row format>

After vectorlization, the number of features has increased to 56901.

## 3. Analysing by some models

- Logistic
- Ada Boost
- Gradient Boosting
- ExtraTree

In [72]:
from sklearn.metrics import confusion_matrix

X = df_vec
Y = df.label

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 225,stratify=Y)

def output_test_result(model):
    y_pred = model.predict(X_test)
    confusion_matrix(y_pred,Y_test)

    print("Accuracy : ", accuracy_score(y_pred,Y_test))
    print("Precision : ", precision_score(y_pred,Y_test, average = 'weighted'))
    print("Recall : ", recall_score(y_pred,Y_test, average = 'weighted'))

### Logistic Regression


In [73]:
lr = LogisticRegression(class_weight='balanced', n_jobs=14)
lr.fit(X_train, Y_train)

LogisticRegression(class_weight='balanced', n_jobs=14)

In [74]:
output_test_result(lr)

Accuracy :  0.9913644214162349
Precision :  0.9915695834540169
Recall :  0.9913644214162349


### Ada Boost Classifier

In [75]:
abc = AdaBoostClassifier(n_estimators=500)
abc.fit(X_train, Y_train)

AdaBoostClassifier(n_estimators=500)

In [76]:
output_test_result(abc)

Accuracy :  0.9896373056994818
Precision :  0.9897669287734618
Recall :  0.9896373056994818


### Gradient Boosting Classifier

In [77]:
gbc = GradientBoostingClassifier(n_estimators=500)
gbc.fit(X_train, Y_train)

GradientBoostingClassifier(n_estimators=500)

In [78]:
output_test_result(gbc)

Accuracy :  0.9792746113989638
Precision :  0.9812178400683695
Recall :  0.9792746113989638


### ExtraTrees Classifier

In [79]:
etc = ExtraTreesClassifier(n_estimators=500, max_features=None, min_samples_leaf=1,
                                                      min_samples_split=9, n_jobs=14,
                                                      class_weight="balanced",
                                                      criterion='gini')
etc.fit(X_train, Y_train)         

ExtraTreesClassifier(class_weight='balanced', max_features=None,
                     min_samples_split=9, n_estimators=500, n_jobs=14)

In [80]:
output_test_result(etc)

Accuracy :  0.9913644214162349
Precision :  0.9913391672656148
Recall :  0.9913644214162349


## 4. Analysing by Naive Bayesian model

There is a trap in the Naive Bayesian model. The naive Bayesian model can't accept the vector in fd-idf, which only can accept the vector in simple counter formal.

In [87]:
tf = CountVectorizer()
X = tf.fit_transform(df.concise_text)
X

<2893x56901 sparse matrix of type '<class 'numpy.int64'>'
	with 508795 stored elements in Compressed Sparse Row format>

In [91]:
Y = df.label

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 225,stratify=Y)

In [92]:
mnb = MultinomialNB(fit_prior=True)
mnb.fit(X_train, Y_train)

MultinomialNB()

In [93]:
output_test_result(mnb)

Accuracy :  0.9896373056994818
Precision :  0.9896368587233648
Recall :  0.9896373056994818
