## Spam Detection Project
(Natural Language Processing)

### 01. Data science project implementation


Unwanted messages, also known as spam, cause email users to waste a lot of time. In addition, they can be a danger to information security because, in some cases, they contain malicious links with malware. To prevent these crimes, it is necessary to develop a security system that detects spam.

The goal of the project is to create a spam detection system.

The dataset consists of text messages from volunteers in a study in Singapore and some spam text messages from a UK reporting site mixed together.

To complete this project all the data was studied and maching learning models was created to do the predictions. <br/> 
All models was used using pyspark with Spark's MLlib. <br/>
The data used is in the file "SMSSpamCollection" in csv format and, after being received, was processed for later use by ML.

In [None]:
# Import related packages
import numpy as np
from pyspark.sql.functions import length

# Feature Selection
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF, StringIndexer
from pyspark.ml.feature import VectorAssembler

# Evaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Build the Model
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import NaiveBayes

# Start spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlp').getOrCreate()

## 02. Sourcing Data

In [None]:
# 02.1 Loading data
data = spark.read.csv('/FileStore/tables/SMSSpamCollection', inferSchema=True, sep='\t') # \t it is separated by tabs, not by commas
data = data.withColumnRenamed('_c0', 'class').withColumnRenamed('_c1', 'text')

data.show(3, truncate=False)

+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|class|text                                                                                                                                                       |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|ham  |Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                            |
|ham  |Ok lar... Joking wif u oni...                                                                                                                              |
|spam |Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's|
+-----+---------

## 03. Exploratory Data Analysis and Data Cleaning

In [None]:
data = data.withColumn('length', length(data['text']))

data.show(3)
data.groupBy('class').mean().show()

+-----+--------------------+------+
|class|                text|length|
+-----+--------------------+------+
|  ham|Go until jurong p...|   111|
|  ham|Ok lar... Joking ...|    29|
| spam|Free entry in 2 a...|   155|
+-----+--------------------+------+
only showing top 3 rows

+-----+-----------------+
|class|      avg(length)|
+-----+-----------------+
|  ham| 71.4545266210897|
| spam|138.6706827309237|
+-----+-----------------+



## 04. Modeling and Evaluation

In [None]:
# 04.1. Data configuration

# 04.1.1 Class index and setting up features
ham_spam_to_numeric = StringIndexer(inputCol='class', outputCol='label')
tokenizer = Tokenizer(inputCol='text', outputCol='token_text')
stop_remove = StopWordsRemover(inputCol='token_text', outputCol='stop_token')
count_vec = CountVectorizer(inputCol='stop_token', outputCol='c_vec')
idf = IDF(inputCol='c_vec', outputCol='tf_idf') # Inverse document frequency
clean_up = VectorAssembler(inputCols=['tf_idf', 'length'], outputCol='features')

data_prep_pipe = Pipeline(stages=[ham_spam_to_numeric, tokenizer, stop_remove, count_vec, idf, clean_up])
clean_data = data_prep_pipe.fit(data).transform(data)

In [None]:
# 04.1.2 Inside the Data Configuration (Step-by-step and results)

step_1 = ham_spam_to_numeric.fit(data).transform(data)
step_1.show(1)

step_2 = tokenizer.transform(step_1)
step_2.show(1)

step_3 = stop_remove.transform(step_2)
step_3.show(1) # .select('token_text', 'stop_token')

step_4 = count_vec.fit(step_3).transform(step_3)
step_4.show(1)

step_5 = idf.fit(step_4).transform(step_4)
step_5.show(1) # .select('c_vec', 'tf_idf')

step_6 = clean_up.transform(step_5)
step_6.show(1)

+-----+--------------------+------+-----+
|class|                text|length|label|
+-----+--------------------+------+-----+
|  ham|Go until jurong p...|   111|  0.0|
+-----+--------------------+------+-----+
only showing top 1 row

+-----+--------------------+------+-----+--------------------+
|class|                text|length|label|          token_text|
+-----+--------------------+------+-----+--------------------+
|  ham|Go until jurong p...|   111|  0.0|[go, until, juron...|
+-----+--------------------+------+-----+--------------------+
only showing top 1 row

+-----+--------------------+------+-----+--------------------+--------------------+
|class|                text|length|label|          token_text|          stop_token|
+-----+--------------------+------+-----+--------------------+--------------------+
|  ham|Go until jurong p...|   111|  0.0|[go, until, juron...|[go, jurong, poin...|
+-----+--------------------+------+-----+--------------------+--------------------+
only sh

In [None]:
# 4.1.3 Restructuring between predictor and target attributes
data_models = clean_data.select('label', 'features')

# 4.1.4 Dividing the sample
train_data, test_data = data_models.randomSplit([0.7, 0.3])

In [None]:
# 04.2. Modeling: Logistic Regression (Test 01)

# 4.2.1. Building a Linear Regression Model object
model_log_r = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction') 
spam_detector = model_log_r.fit(train_data)

# 4.2.2. Run model
test_predictions = spam_detector.transform(test_data)
test_predictions.show(3)

# 4.2.3. Checking the efficiency of the model
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_predictions)
print('ACC of Logistic Regression Model:', acc) # R: 0.9789

evaluator = spam_detector.evaluate(test_data)
evaluator.accuracy # R: 0.9794

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(13424,[0,1,2,7,8...|[24.5166164970199...|[0.99999999997748...|       0.0|
|  0.0|(13424,[0,1,7,15,...|[22.1057883431354...|[0.99999999974905...|       0.0|
|  0.0|(13424,[0,1,9,14,...|[21.6815800738554...|[0.99999999961646...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 3 rows

ACC of Logistic Regression Model: 0.9789464169061226
Out[24]: 0.9794188861985472

In [None]:
# 04.3. Decision tree classifier (Test 02)

# 4.3.1. Building a Linear Regression Model object
model_dtc = DecisionTreeClassifier(featuresCol='features', labelCol='label', predictionCol='prediction') 
spam_detector = model_dtc.fit(train_data)

# 4.3.2. Run model
test_predictions = spam_detector.transform(test_data)
test_predictions.show(3)

# 4.3.3. Checking the efficiency of the model
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_predictions)
print('ACC of Decision Tree Model:', acc) # R: 0.9355

+-----+--------------------+-------------+-----------+----------+
|label|            features|rawPrediction|probability|prediction|
+-----+--------------------+-------------+-----------+----------+
|  0.0|(13424,[0,1,2,7,8...|  [14.0,21.0]|  [0.4,0.6]|       1.0|
|  0.0|(13424,[0,1,7,15,...|  [14.0,21.0]|  [0.4,0.6]|       1.0|
|  0.0|(13424,[0,1,9,14,...|  [14.0,21.0]|  [0.4,0.6]|       1.0|
+-----+--------------------+-------------+-----------+----------+
only showing top 3 rows

ACC of Decision Tree Model: 0.9355091346302511


In [None]:
# 04.4. Support Vector Machine (Test 03)

# 4.4.1. Building a Linear Regression Model object
model_svc = LinearSVC(featuresCol='features', labelCol='label', predictionCol='prediction') 
spam_detector = model_svc.fit(train_data)

# 4.4.2. Run model
test_predictions = spam_detector.transform(test_data)
test_predictions.show(3)

# 4.4.3. Checking the efficiency of the model
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_predictions)
print('ACC of SVC Machine Model:', acc) # R: 0.9829

evaluator = spam_detector.evaluate(test_data)
evaluator.accuracy # R: 0.9830

+-----+--------------------+--------------------+----------+
|label|            features|       rawPrediction|prediction|
+-----+--------------------+--------------------+----------+
|  0.0|(13424,[0,1,2,7,8...|[1.83135196933833...|       0.0|
|  0.0|(13424,[0,1,7,15,...|[1.70028581914278...|       0.0|
|  0.0|(13424,[0,1,9,14,...|[1.86016095733911...|       0.0|
+-----+--------------------+--------------------+----------+
only showing top 3 rows

ACC of SVC Machine Model: 0.9829266416427664
Out[26]: 0.9830508474576272

In [None]:
# 04.5 Modeling: Naive Bayes (Test 04)

# 4.5.1. Building a Gaussian NB Model object
model_nb = NaiveBayes()
spam_detector = model_nb.fit(train_data)

# 4.5.2. Run model
test_predictions = spam_detector.transform(test_data)
test_predictions.show(3)

# 4.5.3. Checking the efficiency of the model
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_predictions)
print('ACC of NB Model:', acc) # R: 0.9256

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(13424,[0,1,2,7,8...|[-790.42125251548...|[1.0,1.0546815041...|       0.0|
|  0.0|(13424,[0,1,7,15,...|[-661.25069354602...|[1.0,1.1702857962...|       0.0|
|  0.0|(13424,[0,1,9,14,...|[-542.13806409330...|[1.0,2.5887078920...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 3 rows

ACC of NB Model: 0.925633960821264


The ML models with the best performance were Logistic Regression and SVC Machine, which achieved excellent scores. <br/>
New tests will then be performed using them to understand which is the best.

In [None]:
# 04.9. Final performance test

resultados_log_regressor_cv, resultados_svc_machine_cv = [], []

for i in range(30):
    model_lr = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction') 
    spam_detector = model_lr.fit(train_data)
    acc_eval = MulticlassClassificationEvaluator()
    resultados_log_regressor_cv.append(acc_eval.evaluate(spam_detector.transform(test_data)))

    model_svc = LinearSVC(featuresCol='features', labelCol='label', predictionCol='prediction')
    spam_detector = model_svc.fit(train_data)
    acc_eval = MulticlassClassificationEvaluator()
    resultados_svc_machine_cv.append(acc_eval.evaluate(spam_detector.transform(test_data)))
    
resultados_log_regressor_cv = np.array(resultados_log_regressor_cv)
resultados_svc_machine_cv = np.array(resultados_svc_machine_cv)

resultados_log_regressor_cv.mean(), resultados_svc_machine_cv.mean()
# (0.9789464169061226, 0.9829266416427664)

Out[27]: (0.9789464169061226, 0.9829266416427664)

## 05. Discussion and Conclusion
According to the results of the models, the algorithm that achieved the best result compared to the other algorithms was Linear Support Vector Machine, with an accuracy of around 98%.

Thank you for following up here and if you have any suggestions or constructive criticism, I'm 100% open!

Joao Ambrosio