<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Naive Bayes Spam Filtering

### Overview

We all hate spam, so developing a classifier solution to classify email as spam or not spam would be useful.  

### Builds on
None

### Run time
20-30 minutes.

### Notes

We will use TF-IDF to vectorize our texts, then NaiveBayes to classify them.

In [2]:
%matplotlib inline

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, VectorAssembler
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer


## Step 1: Let's load the text data

We will load the data into a Spark DataFrame. 

In [3]:
t1 = time.perf_counter()

data = spark.read.format("csv").option('header','true').option('delimiter', '\t').\
  option('inferSchema', 'true').load("/data/spam/SMSSpamCollection.txt")

t2 = time.perf_counter() 

print("read {:,} records in {:,.2f} ms".format(data.count(), (t2-t1)*1000))
data.show(5)
# If you want to see full text, do
# data.show(5, False)

read 5,574 records in 9,220.83 ms
+------+--------------------+
|isspam|                text|
+------+--------------------+
|   ham|Go until jurong p...|
|   ham|Ok lar... Joking ...|
|  spam|Free entry in 2 a...|
|   ham|U dun say so earl...|
|   ham|Nah I don't think...|
+------+--------------------+
only showing top 5 rows



In [4]:
## Count spam/ham
data.groupby("isspam").count().show()

+------+-----+
|isspam|count|
+------+-----+
|   ham| 4827|
|  spam|  747|
+------+-----+



## Step 2: Convert each text entry to a vector of words


In [5]:
# Tokenize each sentence into words
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(data)
wordsData.printSchema
wordsData.show(5)

# If you want to see all words, do
# wordsData.show(5, False)

+------+--------------------+--------------------+
|isspam|                text|               words|
+------+--------------------+--------------------+
|   ham|Go until jurong p...|[go, until, juron...|
|   ham|Ok lar... Joking ...|[ok, lar..., joki...|
|  spam|Free entry in 2 a...|[free, entry, in,...|
|   ham|U dun say so earl...|[u, dun, say, so,...|
|   ham|Nah I don't think...|[nah, i, don't, t...|
+------+--------------------+--------------------+
only showing top 5 rows



## Step 3: Compute TF

In [6]:
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)
featurizedData.show(5)

# If you want to see word positions and TF frequencies, do
# featurizedData.show(5, False)

+------+--------------------+--------------------+--------------------+
|isspam|                text|               words|         rawFeatures|
+------+--------------------+--------------------+--------------------+
|   ham|Go until jurong p...|[go, until, juron...|(1000,[7,77,150,1...|
|   ham|Ok lar... Joking ...|[ok, lar..., joki...|(1000,[20,316,484...|
|  spam|Free entry in 2 a...|[free, entry, in,...|(1000,[30,35,73,1...|
|   ham|U dun say so earl...|[u, dun, say, so,...|(1000,[57,368,372...|
|   ham|Nah I don't think...|[nah, i, don't, t...|(1000,[135,163,32...|
+------+--------------------+--------------------+--------------------+
only showing top 5 rows



## Step 4: Multiple TF by IDF

In [7]:
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("isspam", "features").show(5)

+------+--------------------+
|isspam|            features|
+------+--------------------+
|   ham|(1000,[7,77,150,1...|
|   ham|(1000,[20,316,484...|
|  spam|(1000,[30,35,73,1...|
|   ham|(1000,[57,368,372...|
|   ham|(1000,[135,163,32...|
+------+--------------------+
only showing top 5 rows



## Step 4: Use StringIndexer to create a numeric label from the string column "isspam."

In [8]:

indexer = StringIndexer(inputCol="isspam", outputCol="label")
indexed = indexer.fit(rescaledData).transform(rescaledData)
indexed.show()


+------+--------------------+--------------------+--------------------+--------------------+-----+
|isspam|                text|               words|         rawFeatures|            features|label|
+------+--------------------+--------------------+--------------------+--------------------+-----+
|   ham|Go until jurong p...|[go, until, juron...|(1000,[7,77,150,1...|(1000,[7,77,150,1...|  0.0|
|   ham|Ok lar... Joking ...|[ok, lar..., joki...|(1000,[20,316,484...|(1000,[20,316,484...|  0.0|
|  spam|Free entry in 2 a...|[free, entry, in,...|(1000,[30,35,73,1...|(1000,[30,35,73,1...|  1.0|
|   ham|U dun say so earl...|[u, dun, say, so,...|(1000,[57,368,372...|(1000,[57,368,372...|  0.0|
|   ham|Nah I don't think...|[nah, i, don't, t...|(1000,[135,163,32...|(1000,[135,163,32...|  0.0|
|  spam|FreeMsg Hey there...|[freemsg, hey, th...|(1000,[25,36,68,9...|(1000,[25,36,68,9...|  1.0|
|   ham|Even my brother i...|[even, my, brothe...|(1000,[18,47,48,5...|(1000,[18,47,48,5...|  0.0|
|   ham|As

## Step 5: Split into Training and Test

We will split our dataset into training and test sets.

In [9]:
# Split the data into train and test
(train, test) = indexed.randomSplit([0.8, 0.2], seed=1234)

print("training set count : ", train.count())
print("testing set count : ", test.count())

training set count :  4465
testing set count :  1109


## Step 6: Fit Naive Bayes Model

In [10]:

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
t1 = time.perf_counter()
model = nb.fit(train)
t2 = time.perf_counter()

print("trained on {:,} records  in {:,.2f} ms".\
      format(train.count(), (t2-t1)*1000))

# Confusion matrix on training data
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
train_fit = model.transform(train)
accuracy = evaluator.evaluate(train_fit)
print("Train set accuracy = " + str(accuracy))
train_fit.groupBy('label').pivot('prediction', [0,1]).count().na.fill(0).orderBy('label').show()

trained on 4,465 records  in 1,293.90 ms
Train set accuracy = 0.9475923852183651
+-----+----+---+
|label|   0|  1|
+-----+----+---+
|  0.0|3678|199|
|  1.0|  35|553|
+-----+----+---+



## Step 7: Evaluate the Model

Let's look at how our model performs.  We will do an accuracy measure.

In [11]:
test_predictions = model.transform(test)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(test_predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.9323715058611362


Let us create a confusion matrix.

In [12]:
test_predictions.groupBy('label').pivot('prediction', [0,1]).count().na.fill(0).orderBy('label').show()

## Can you explain the confusion matrix
print(test_predictions.count())

+-----+---+---+
|label|  0|  1|
+-----+---+---+
|  0.0|886| 64|
|  1.0| 11|148|
+-----+---+---+

1109


## Step 8: Improve prediction results

We used too few features above, and got bad accuracy. Increase the number of features for HashingTF

## Step 9:  Your own test

Now it's your turn!   Make a new dataframe with some sample test data of your own creation.  Make some "spammy" SMSes and some ordinary ones.  See how our spam filter does.

In [13]:
# TODO: make a dataframe with some of your own data.
mydata = pd.DataFrame({'text' : ['hey, can we meet 1 hr later?', 
                                'WINNER!  Click here to claim your FREE car !!!!',
                                'CHEAP DEGREEES !!', 
                                'your text here',
                                'FREE phones']
                         })

mydata2 = spark.createDataFrame(mydata)
tokenizer = Tokenizer(inputCol="text", outputCol="words")
fv = tokenizer.transform(mydata2)
fv.show()

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=1000)
fv = hashingTF.transform(fv)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
fv = idfModel.transform(fv)
fv.show()

+--------------------+--------------------+
|                text|               words|
+--------------------+--------------------+
|hey, can we meet ...|[hey,, can, we, m...|
|WINNER!  Click he...|[winner!, , click...|
|   CHEAP DEGREEES !!|[cheap, degreees,...|
|      your text here|  [your, text, here]|
|         FREE phones|      [free, phones]|
+--------------------+--------------------+

+--------------------+--------------------+--------------------+--------------------+
|                text|               words|         rawFeatures|            features|
+--------------------+--------------------+--------------------+--------------------+
|hey, can we meet ...|[hey,, can, we, m...|(1000,[238,486,74...|(1000,[238,486,74...|
|WINNER!  Click he...|[winner!, , click...|(1000,[73,135,263...|(1000,[73,135,263...|
|   CHEAP DEGREEES !!|[cheap, degreees,...|(1000,[119,339,66...|(1000,[119,339,66...|
|      your text here|  [your, text, here]|(1000,[135,169,26...|(1000,[135,169,26...|
|

In [14]:
predictions = model.transform(fv)
predictions.select(['text', 'prediction']).show()

+--------------------+----------+
|                text|prediction|
+--------------------+----------+
|hey, can we meet ...|       0.0|
|WINNER!  Click he...|       0.0|
|   CHEAP DEGREEES !!|       0.0|
|      your text here|       1.0|
|         FREE phones|       1.0|
+--------------------+----------+



## FUN : How will you defeat this algorithm? :-) 

If you are spammer, how can you defeat this algorithm?

Some approaches
- Find alternate words for spammy words (e.g   FREE --> no cost)
- Misspell words : Winner --> w1nner,   FREE --> FR33

<img src="../assets/images/come-tothe-dark-side-iin-we-have-cookies.png">

