<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Naive Bayes Spam Filtering

### Overview

We all hate spam, so developing a classifier to classify email as spam or not spam is useful.  

### Builds on
None

### Run time
approx. 20-30 minutes

### Notes

PySpark has a class called NaiveBayes that can be used to do Naive Bayes classification.

In [29]:
%matplotlib inline

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print('Spark UI running on http://18.208.221.237:' + sc.uiWebUrl.split(':')[2])

Spark UI running on http://18.208.221.237:4040


## Step 1: Let's load the dataframe

We will load the dataframe into spark.  Since the outcome label is "ham" or "spam", we'll just call it label.

In [30]:
t1 = time.perf_counter()

dataset = spark.read.format("csv").\
          option('header','true').\
          option('delimiter', '\t').\
          option('inferSchema', 'true').\
          load("/data/spam/SMSSpamCollection.txt")

t2 = time.perf_counter() 

print("read {:,} records in {:,.2f} ms".format(dataset.count(), (t2-t1)*1000))

dataset.printSchema()
dataset.show()

read 5,574 records in 128.62 ms
root
 |-- isspam: string (nullable = true)
 |-- text: string (nullable = true)

+------+--------------------+
|isspam|                text|
+------+--------------------+
|   ham|Go until jurong p...|
|   ham|Ok lar... Joking ...|
|  spam|Free entry in 2 a...|
|   ham|U dun say so earl...|
|   ham|Nah I don't think...|
|  spam|FreeMsg Hey there...|
|   ham|Even my brother i...|
|   ham|As per your reque...|
|  spam|WINNER!! As a val...|
|  spam|Had your mobile 1...|
|   ham|I'm gonna be home...|
|  spam|SIX chances to wi...|
|  spam|URGENT! You have ...|
|   ham|I've been searchi...|
|   ham|I HAVE A DATE ON ...|
|  spam|XXXMobileMovieClu...|
|   ham|Oh k...i'm watchi...|
|   ham|Eh u remember how...|
|   ham|Fine if thats th...|
|  spam|England v Macedon...|
+------+--------------------+
only showing top 20 rows



In [31]:
## Count spam/ham
dataset.groupby("isspam").count().show()

+------+-----+
|isspam|count|
+------+-----+
|   ham| 4827|
|  spam|  747|
+------+-----+



## Step 2: Vectorize using tf/idf

Let's use tf/idf for vecorization at first.  TF/IDF will take and count the instances of each term, and then divide by the total frequecy of that term in the entire dataset.  

This leads to very highly dimensional data, because every word in the document will lead to a dimension in the data.

In [32]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

## TODO : split the text into words
## Hint : outputCol = 'words'
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(dataset)
wordsData.show()


+------+--------------------+--------------------+
|isspam|                text|               words|
+------+--------------------+--------------------+
|   ham|Go until jurong p...|[go, until, juron...|
|   ham|Ok lar... Joking ...|[ok, lar..., joki...|
|  spam|Free entry in 2 a...|[free, entry, in,...|
|   ham|U dun say so earl...|[u, dun, say, so,...|
|   ham|Nah I don't think...|[nah, i, don't, t...|
|  spam|FreeMsg Hey there...|[freemsg, hey, th...|
|   ham|Even my brother i...|[even, my, brothe...|
|   ham|As per your reque...|[as, per, your, r...|
|  spam|WINNER!! As a val...|[winner!!, as, a,...|
|  spam|Had your mobile 1...|[had, your, mobil...|
|   ham|I'm gonna be home...|[i'm, gonna, be, ...|
|  spam|SIX chances to wi...|[six, chances, to...|
|  spam|URGENT! You have ...|[urgent!, you, ha...|
|   ham|I've been searchi...|[i've, been, sear...|
|   ham|I HAVE A DATE ON ...|[i, have, a, date...|
|  spam|XXXMobileMovieClu...|[xxxmobilemoviecl...|
|   ham|Oh k...i'm watchi...|[o

In [43]:
## compute the hash of words
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.show()


+------+--------------------+--------------------+--------------------+--------------------+
|isspam|                text|               words|         rawFeatures|            features|
+------+--------------------+--------------------+--------------------+--------------------+
|   ham|Go until jurong p...|[go, until, juron...|(1000,[7,77,150,1...|(1000,[7,77,150,1...|
|   ham|Ok lar... Joking ...|[ok, lar..., joki...|(1000,[20,316,484...|(1000,[20,316,484...|
|  spam|Free entry in 2 a...|[free, entry, in,...|(1000,[30,35,73,1...|(1000,[30,35,73,1...|
|   ham|U dun say so earl...|[u, dun, say, so,...|(1000,[57,368,372...|(1000,[57,368,372...|
|   ham|Nah I don't think...|[nah, i, don't, t...|(1000,[135,163,32...|(1000,[135,163,32...|
|  spam|FreeMsg Hey there...|[freemsg, hey, th...|(1000,[25,36,68,9...|(1000,[25,36,68,9...|
|   ham|Even my brother i...|[even, my, brothe...|(1000,[18,47,48,5...|(1000,[18,47,48,5...|
|   ham|As per your reque...|[as, per, your, r...|(1000,[36,71,92,2...

In [44]:
rescaledData.select("isspam", "text", "features").show()

+------+--------------------+--------------------+
|isspam|                text|            features|
+------+--------------------+--------------------+
|   ham|Go until jurong p...|(1000,[7,77,150,1...|
|   ham|Ok lar... Joking ...|(1000,[20,316,484...|
|  spam|Free entry in 2 a...|(1000,[30,35,73,1...|
|   ham|U dun say so earl...|(1000,[57,368,372...|
|   ham|Nah I don't think...|(1000,[135,163,32...|
|  spam|FreeMsg Hey there...|(1000,[25,36,68,9...|
|   ham|Even my brother i...|(1000,[18,47,48,5...|
|   ham|As per your reque...|(1000,[36,71,92,2...|
|  spam|WINNER!! As a val...|(1000,[39,43,61,7...|
|  spam|Had your mobile 1...|(1000,[36,73,82,1...|
|   ham|I'm gonna be home...|(1000,[26,41,106,...|
|  spam|SIX chances to wi...|(1000,[15,35,36,4...|
|  spam|URGENT! You have ...|(1000,[68,73,122,...|
|   ham|I've been searchi...|(1000,[19,36,39,1...|
|   ham|I HAVE A DATE ON ...|(1000,[44,82,170,...|
|  spam|XXXMobileMovieClu...|(1000,[41,43,49,6...|
|   ham|Oh k...i'm watchi...|(1

## Step 3: Create a numeric label out of the string column "isspam."

In [45]:
from pyspark.ml.feature import StringIndexer

## TODO : Index 'isspam' column into 'label' column
## Hint : inputCol = 'isspam',   outputCol = 'label'
indexer = StringIndexer(inputCol="isspam", outputCol="label")
indexed = indexer.fit(rescaledData).transform(rescaledData)

indexed.select(['text', 'isspam', 'label', 'features']).show()


+--------------------+------+-----+--------------------+
|                text|isspam|label|            features|
+--------------------+------+-----+--------------------+
|Go until jurong p...|   ham|  0.0|(1000,[7,77,150,1...|
|Ok lar... Joking ...|   ham|  0.0|(1000,[20,316,484...|
|Free entry in 2 a...|  spam|  1.0|(1000,[30,35,73,1...|
|U dun say so earl...|   ham|  0.0|(1000,[57,368,372...|
|Nah I don't think...|   ham|  0.0|(1000,[135,163,32...|
|FreeMsg Hey there...|  spam|  1.0|(1000,[25,36,68,9...|
|Even my brother i...|   ham|  0.0|(1000,[18,47,48,5...|
|As per your reque...|   ham|  0.0|(1000,[36,71,92,2...|
|WINNER!! As a val...|  spam|  1.0|(1000,[39,43,61,7...|
|Had your mobile 1...|  spam|  1.0|(1000,[36,73,82,1...|
|I'm gonna be home...|   ham|  0.0|(1000,[26,41,106,...|
|SIX chances to wi...|  spam|  1.0|(1000,[15,35,36,4...|
|URGENT! You have ...|  spam|  1.0|(1000,[68,73,122,...|
|I've been searchi...|   ham|  0.0|(1000,[19,36,39,1...|
|I HAVE A DATE ON ...|   ham|  

## Step 4: Split into training and test

We will split our dataset into training and test sets.

In [46]:
# TODO : Split the data into train and test into 80/20
(train, test) = indexed.randomSplit([80.0, 20.0])

print("training set count : ", train.count())
print("testing set count : ", test.count())

training set count :  4489
testing set count :  1085


## Step 5: Fit Naive Bayes model

In [47]:
from pyspark.ml.classification import NaiveBayes

## TODO : create the trainer and set its parameters
## Hint : NaiveBayes  (see the class name above)
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
t1 = time.perf_counter()
## TODO : fit on training data (hint: train)
model = nb.fit(train)
t2 = time.perf_counter()

print("trained on {:,} records  in {:,.2f} ms".\
      format(train.count(), (t2-t1)*1000))

trained on 4,489 records  in 983.07 ms


## Step 6: Run test data

Let's call .transform on our model to do make predictions on our test data. The output should be contained in the "prediction" column, while the correct label will be there in the "label" column. 

We will be able to evaluate our results by comparing the results.

In [48]:
# select example rows to display.
## TODO : transform on test data (hint : test)
predictions = model.transform(test)
predictions.show()


+------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|isspam|                text|               words|         rawFeatures|            features|label|       rawPrediction|         probability|prediction|
+------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|   ham| said kiss, kiss,...|[, said, kiss,, k...|(1000,[44,51,133,...|(1000,[44,51,133,...|  0.0|[-638.12813753362...|[1.0,2.2669448674...|       0.0|
|   ham| says that he's q...|[, says, that, he...|(1000,[38,122,138...|(1000,[38,122,138...|  0.0|[-887.09354460589...|[1.0,1.2982581398...|       0.0|
|   ham|"SYMPTOMS" when U...|["symptoms", when...|(1000,[15,76,138,...|(1000,[15,76,138,...|  0.0|[-604.72010787181...|[1.0,3.6560762633...|       0.0|
|   ham|&lt;#&gt; , that'...|[&lt;#&gt;, ,, th...|(1000,[88,597,760...|(1000,[88,597,760

## Step 7: Evaluate the model

Let's look at how our model performs.  We will do an accuracy measure.

In [49]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.9345622119815669


Let us do a confusion matrix.

In [50]:
predictions.groupBy('label').pivot('prediction', [0,1]).count().na.fill(0).orderBy('label').show()

## Can you explain the confusion matrix

+-----+---+---+
|label|  0|  1|
+-----+---+---+
|  0.0|883| 55|
|  1.0| 16|131|
+-----+---+---+



## Step 8: Improve prediction results

We used too few features above, and got bad accuracy. Increase the number of features for HashingTF

## Step 9:  Run your own test

Now it's your turn!   Make a new dataframe with some sample test data of your own creation.  Make some "spammy" SMSes and some ordinary ones.  See how our spam filter does.

In [51]:
# TODO: make a dataframe with some of your own data.
mydata = pd.DataFrame({'text' : ['hey, can we meet 1 hr later?', 
                                'WINNER!  Click here to claim your prize !!!!',
                                'CHEAP DEGREEES !!', 
                                'your text here',
                                'test']
                         })

mydata2 = spark.createDataFrame(mydata)
tokenizer = Tokenizer(inputCol="text", outputCol="words")
fv = tokenizer.transform(mydata2)
fv.show()

## NOTE : make sure this 'numFeatures' matches the 'numFeatures' in step-2
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=1000)
fv = hashingTF.transform(fv)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
fv = idfModel.transform(fv)
fv.show()

+--------------------+--------------------+
|                text|               words|
+--------------------+--------------------+
|hey, can we meet ...|[hey,, can, we, m...|
|WINNER!  Click he...|[winner!, , click...|
|   CHEAP DEGREEES !!|[cheap, degreees,...|
|      your text here|  [your, text, here]|
|                test|              [test]|
+--------------------+--------------------+

+--------------------+--------------------+--------------------+--------------------+
|                text|               words|         rawFeatures|            features|
+--------------------+--------------------+--------------------+--------------------+
|hey, can we meet ...|[hey,, can, we, m...|(1000,[238,486,74...|(1000,[238,486,74...|
|WINNER!  Click he...|[winner!, , click...|(1000,[135,189,26...|(1000,[135,189,26...|
|   CHEAP DEGREEES !!|[cheap, degreees,...|(1000,[119,339,66...|(1000,[119,339,66...|
|      your text here|  [your, text, here]|(1000,[135,169,26...|(1000,[135,169,26...|
|

In [52]:
predictions = model.transform(fv)
predictions.select(['text', 'prediction']).show()

+--------------------+----------+
|                text|prediction|
+--------------------+----------+
|hey, can we meet ...|       0.0|
|WINNER!  Click he...|       0.0|
|   CHEAP DEGREEES !!|       0.0|
|      your text here|       1.0|
|                test|       0.0|
+--------------------+----------+



## FUN : How will you defeat this algorithm? :-) 

If you are spammer, how can you defeat this algorithm?

<img src="../assets/images/come-tothe-dark-side-iin-we-have-cookies.png">

# BONUS: Word2Vec Instead of TF/IDF

We used the TF/IDF encoding. We might get better resu

lts if we use Word2Vec instead. Run with word2vec and see if you get a better accuracy rate.