# Project

## Spam detection

Design an SMS Spam detection using spark NLP tools and a Naive Bayes classifier.

## Dataset 

UCI Repository SMS Spam Detection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

The dataset contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. 

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlpProject').getOrCreate()

In [17]:
data = spark.read.csv("smsspamcollection/SMSSpamCollection",inferSchema=True,sep='\t')

In [18]:
data.show()

+----+--------------------+
| _c0|                 _c1|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
|spam|FreeMsg Hey there...|
| ham|Even my brother i...|
| ham|As per your reque...|
|spam|WINNER!! As a val...|
|spam|Had your mobile 1...|
| ham|I'm gonna be home...|
|spam|SIX chances to wi...|
|spam|URGENT! You have ...|
| ham|I've been searchi...|
| ham|I HAVE A DATE ON ...|
|spam|XXXMobileMovieClu...|
| ham|Oh k...i'm watchi...|
| ham|Eh u remember how...|
| ham|Fine if thats th...|
|spam|England v Macedon...|
+----+--------------------+
only showing top 20 rows



In [21]:
df = data.withColumnRenamed('_c0', 'label').withColumnRenamed('_c1', 'message')
df.show()

+-----+--------------------+
|label|             message|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
| spam|FreeMsg Hey there...|
|  ham|Even my brother i...|
|  ham|As per your reque...|
| spam|WINNER!! As a val...|
| spam|Had your mobile 1...|
|  ham|I'm gonna be home...|
| spam|SIX chances to wi...|
| spam|URGENT! You have ...|
|  ham|I've been searchi...|
|  ham|I HAVE A DATE ON ...|
| spam|XXXMobileMovieClu...|
|  ham|Oh k...i'm watchi...|
|  ham|Eh u remember how...|
|  ham|Fine if thats th...|
| spam|England v Macedon...|
+-----+--------------------+
only showing top 20 rows



In [27]:
# Tokenizing on the message column
from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(inputCol="message", outputCol="words", pattern="\\w+", gaps= False)
df_tokenized = regexTokenizer.transform(df)
df_tokenized.show()

+-----+--------------------+--------------------+
|label|             message|               words|
+-----+--------------------+--------------------+
|  ham|Go until jurong p...|[go, until, juron...|
|  ham|Ok lar... Joking ...|[ok, lar, joking,...|
| spam|Free entry in 2 a...|[free, entry, in,...|
|  ham|U dun say so earl...|[u, dun, say, so,...|
|  ham|Nah I don't think...|[nah, i, don, t, ...|
| spam|FreeMsg Hey there...|[freemsg, hey, th...|
|  ham|Even my brother i...|[even, my, brothe...|
|  ham|As per your reque...|[as, per, your, r...|
| spam|WINNER!! As a val...|[winner, as, a, v...|
| spam|Had your mobile 1...|[had, your, mobil...|
|  ham|I'm gonna be home...|[i, m, gonna, be,...|
| spam|SIX chances to wi...|[six, chances, to...|
| spam|URGENT! You have ...|[urgent, you, hav...|
|  ham|I've been searchi...|[i, ve, been, sea...|
|  ham|I HAVE A DATE ON ...|[i, have, a, date...|
| spam|XXXMobileMovieClu...|[xxxmobilemoviecl...|
|  ham|Oh k...i'm watchi...|[oh, k, i, m, wat...|


In [28]:
#StopWord removing on message column
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol ='words', outputCol ='removed')
df_tokenized_stw = remover.transform(df_tokenized)
df_tokenized_stw.show()

+-----+--------------------+--------------------+--------------------+
|label|             message|               words|             removed|
+-----+--------------------+--------------------+--------------------+
|  ham|Go until jurong p...|[go, until, juron...|[go, jurong, poin...|
|  ham|Ok lar... Joking ...|[ok, lar, joking,...|[ok, lar, joking,...|
| spam|Free entry in 2 a...|[free, entry, in,...|[free, entry, 2, ...|
|  ham|U dun say so earl...|[u, dun, say, so,...|[u, dun, say, ear...|
|  ham|Nah I don't think...|[nah, i, don, t, ...|[nah, think, goes...|
| spam|FreeMsg Hey there...|[freemsg, hey, th...|[freemsg, hey, da...|
|  ham|Even my brother i...|[even, my, brothe...|[even, brother, l...|
|  ham|As per your reque...|[as, per, your, r...|[per, request, me...|
| spam|WINNER!! As a val...|[winner, as, a, v...|[winner, valued, ...|
| spam|Had your mobile 1...|[had, your, mobil...|[mobile, 11, mont...|
|  ham|I'm gonna be home...|[i, m, gonna, be,...|[m, gonna, home, ...|
| spam

In [29]:
# get TF-IDF feature vector
from pyspark.ml.feature import HashingTF, IDF
# 1st get TF (term frequency) 
hashing_tf = HashingTF(inputCol='words', outputCol='rawFeatures') 
featurized_data = hashing_tf.transform(df_tokenized_stw)

# 1st get IDF
idf = IDF(inputCol='rawFeatures', outputCol='features')
idf_model = idf.fit(featurized_data)
rescaled_data = idf_model.transform(featurized_data)

In [30]:
rescaled_data.show()

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|label|             message|               words|             removed|         rawFeatures|            features|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|  ham|Go until jurong p...|[go, until, juron...|[go, jurong, poin...|(262144,[38555,52...|(262144,[38555,52...|
|  ham|Ok lar... Joking ...|[ok, lar, joking,...|[ok, lar, joking,...|(262144,[16877,51...|(262144,[16877,51...|
| spam|Free entry in 2 a...|[free, entry, in,...|[free, entry, 2, ...|(262144,[12250,12...|(262144,[12250,12...|
|  ham|U dun say so earl...|[u, dun, say, so,...|[u, dun, say, ear...|(262144,[2306,517...|(262144,[2306,517...|
|  ham|Nah I don't think...|[nah, i, don, t, ...|[nah, think, goes...|(262144,[19036,25...|(262144,[19036,25...|
| spam|FreeMsg Hey there...|[freemsg, hey, th...|[freemsg, hey, da...|(262144,[19036,19...|(2621