# Project

## Spam detection

Design an SMS Spam detection using spark NLP tools and a Naive Bayes classifier.

## Dataset 

UCI Repository SMS Spam Detection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

The dataset contains one set of SMS messages in English inclusing 5,574 messages that are tagged as being ham (legitimate) or spam. 

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlpProject').getOrCreate()

In [2]:
data = spark.read.csv("smsspamcollection/SMSSpamCollection",inferSchema=True,sep='\t')

In [3]:
data.show()

+----+--------------------+
| _c0|                 _c1|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
|spam|FreeMsg Hey there...|
| ham|Even my brother i...|
| ham|As per your reque...|
|spam|WINNER!! As a val...|
|spam|Had your mobile 1...|
| ham|I'm gonna be home...|
|spam|SIX chances to wi...|
|spam|URGENT! You have ...|
| ham|I've been searchi...|
| ham|I HAVE A DATE ON ...|
|spam|XXXMobileMovieClu...|
| ham|Oh k...i'm watchi...|
| ham|Eh u remember how...|
| ham|Fine if thats th...|
|spam|England v Macedon...|
+----+--------------------+
only showing top 20 rows



In [4]:
df = data.withColumnRenamed('_c0', 'tag').withColumnRenamed('_c1', 'message')
df.show()

+----+--------------------+
| tag|             message|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
|spam|FreeMsg Hey there...|
| ham|Even my brother i...|
| ham|As per your reque...|
|spam|WINNER!! As a val...|
|spam|Had your mobile 1...|
| ham|I'm gonna be home...|
|spam|SIX chances to wi...|
|spam|URGENT! You have ...|
| ham|I've been searchi...|
| ham|I HAVE A DATE ON ...|
|spam|XXXMobileMovieClu...|
| ham|Oh k...i'm watchi...|
| ham|Eh u remember how...|
| ham|Fine if thats th...|
|spam|England v Macedon...|
+----+--------------------+
only showing top 20 rows



In [5]:
# Tokenizing on the message column
from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(inputCol="message", outputCol="words", pattern="\\w+", gaps= False)
df_tokenized = regexTokenizer.transform(df)
df_tokenized.show()

+----+--------------------+--------------------+
| tag|             message|               words|
+----+--------------------+--------------------+
| ham|Go until jurong p...|[go, until, juron...|
| ham|Ok lar... Joking ...|[ok, lar, joking,...|
|spam|Free entry in 2 a...|[free, entry, in,...|
| ham|U dun say so earl...|[u, dun, say, so,...|
| ham|Nah I don't think...|[nah, i, don, t, ...|
|spam|FreeMsg Hey there...|[freemsg, hey, th...|
| ham|Even my brother i...|[even, my, brothe...|
| ham|As per your reque...|[as, per, your, r...|
|spam|WINNER!! As a val...|[winner, as, a, v...|
|spam|Had your mobile 1...|[had, your, mobil...|
| ham|I'm gonna be home...|[i, m, gonna, be,...|
|spam|SIX chances to wi...|[six, chances, to...|
|spam|URGENT! You have ...|[urgent, you, hav...|
| ham|I've been searchi...|[i, ve, been, sea...|
| ham|I HAVE A DATE ON ...|[i, have, a, date...|
|spam|XXXMobileMovieClu...|[xxxmobilemoviecl...|
| ham|Oh k...i'm watchi...|[oh, k, i, m, wat...|
| ham|Eh u remember 

In [6]:
#StopWord removing on message column
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol ='words', outputCol ='removed')
df_tokenized_stw = remover.transform(df_tokenized)
df_tokenized_stw.show()

+----+--------------------+--------------------+--------------------+
| tag|             message|               words|             removed|
+----+--------------------+--------------------+--------------------+
| ham|Go until jurong p...|[go, until, juron...|[go, jurong, poin...|
| ham|Ok lar... Joking ...|[ok, lar, joking,...|[ok, lar, joking,...|
|spam|Free entry in 2 a...|[free, entry, in,...|[free, entry, 2, ...|
| ham|U dun say so earl...|[u, dun, say, so,...|[u, dun, say, ear...|
| ham|Nah I don't think...|[nah, i, don, t, ...|[nah, think, goes...|
|spam|FreeMsg Hey there...|[freemsg, hey, th...|[freemsg, hey, da...|
| ham|Even my brother i...|[even, my, brothe...|[even, brother, l...|
| ham|As per your reque...|[as, per, your, r...|[per, request, me...|
|spam|WINNER!! As a val...|[winner, as, a, v...|[winner, valued, ...|
|spam|Had your mobile 1...|[had, your, mobil...|[mobile, 11, mont...|
| ham|I'm gonna be home...|[i, m, gonna, be,...|[m, gonna, home, ...|
|spam|SIX chances to

In [7]:
# get TF-IDF feature vector
from pyspark.ml.feature import HashingTF, IDF
# 1st step: get TF (term frequency) 
hashing_tf = HashingTF(inputCol='words', outputCol='rawFeatures') 
featurized_data = hashing_tf.transform(df_tokenized_stw)

# 2nd step: get IDF
idf = IDF(inputCol='rawFeatures', outputCol='features')
idf_model = idf.fit(featurized_data)
rescaled_data = idf_model.transform(featurized_data)

In [8]:
rescaled_data.show()

+----+--------------------+--------------------+--------------------+--------------------+--------------------+
| tag|             message|               words|             removed|         rawFeatures|            features|
+----+--------------------+--------------------+--------------------+--------------------+--------------------+
| ham|Go until jurong p...|[go, until, juron...|[go, jurong, poin...|(262144,[38555,52...|(262144,[38555,52...|
| ham|Ok lar... Joking ...|[ok, lar, joking,...|[ok, lar, joking,...|(262144,[16877,51...|(262144,[16877,51...|
|spam|Free entry in 2 a...|[free, entry, in,...|[free, entry, 2, ...|(262144,[12250,12...|(262144,[12250,12...|
| ham|U dun say so earl...|[u, dun, say, so,...|[u, dun, say, ear...|(262144,[2306,517...|(262144,[2306,517...|
| ham|Nah I don't think...|[nah, i, don, t, ...|[nah, think, goes...|(262144,[19036,25...|(262144,[19036,25...|
|spam|FreeMsg Hey there...|[freemsg, hey, th...|[freemsg, hey, da...|(262144,[19036,19...|(262144,[19036

## Create a binary index label column

In [9]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="tag", outputCol="label")
rescaled_data = indexer.fit(rescaled_data).transform(rescaled_data)

rescaled_data.show()

+----+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
| tag|             message|               words|             removed|         rawFeatures|            features|label|
+----+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
| ham|Go until jurong p...|[go, until, juron...|[go, jurong, poin...|(262144,[38555,52...|(262144,[38555,52...|  0.0|
| ham|Ok lar... Joking ...|[ok, lar, joking,...|[ok, lar, joking,...|(262144,[16877,51...|(262144,[16877,51...|  0.0|
|spam|Free entry in 2 a...|[free, entry, in,...|[free, entry, 2, ...|(262144,[12250,12...|(262144,[12250,12...|  1.0|
| ham|U dun say so earl...|[u, dun, say, so,...|[u, dun, say, ear...|(262144,[2306,517...|(262144,[2306,517...|  0.0|
| ham|Nah I don't think...|[nah, i, don, t, ...|[nah, think, goes...|(262144,[19036,25...|(262144,[19036,25...|  0.0|
|spam|FreeMsg Hey there...|[freemsg, hey, th...|[freemsg

## Train/test split

In [10]:
rescaled_data.count()

5574

In [11]:
train, test = rescaled_data.randomSplit([0.7,0.3])

In [12]:
print (f"train size is {train.count()} and test size is {test.count()}")

train size is 3835 and test size is 1739


# Naive Bayes classifier:

In [13]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [14]:
nb = NaiveBayes( smoothing=1.0,  modelType='multinomial', featuresCol='features', labelCol='label')

In [15]:
nb_model = nb.fit(train)

In [16]:
test_result = nb_model.transform(test)
test_result.select('label',
  'probability',
 'prediction').show(truncate =False)


+-----+-------------------------------------------+----------+
|label|probability                                |prediction|
+-----+-------------------------------------------+----------+
|0.0  |[1.0,5.502340741671167E-33]                |0.0       |
|0.0  |[1.0,3.585780293161731E-21]                |0.0       |
|0.0  |[1.0,3.0248056266939986E-84]               |0.0       |
|0.0  |[1.0,1.036257089583559E-62]                |0.0       |
|0.0  |[1.0,2.3702720436641773E-145]              |0.0       |
|0.0  |[1.0,2.0679794336363576E-63]               |0.0       |
|0.0  |[1.0,1.1672419820682577E-84]               |0.0       |
|0.0  |[1.0,1.127062778635335E-37]                |0.0       |
|0.0  |[1.0,2.1757785434878435E-54]               |0.0       |
|0.0  |[1.0,6.448160649666164E-112]               |0.0       |
|0.0  |[1.0,1.4186556060283386E-50]               |0.0       |
|0.0  |[1.0,2.009276465738912E-30]                |0.0       |
|0.0  |[1.0,2.1452869468664015E-22]               |0.0 

In [17]:
nb_eval = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='f1')

In [19]:
f1 = nb_eval.evaluate(test_result)

print(f'F1 score is computed as {f1} on test set')

F1 score is computed as 0.9819213845067308 on test set
