# NLP Code Along questions

For this code along we will build a spam filter!

We'll use a classic dataset for this - UCI Repository SMS Spam Detection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

#### load and  read the dataset,  have Spark infer the data types

In [1]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as funct
spark = SparkSession.builder.appName('Spam_filter').getOrCreate()

In [2]:
input_text = spark.read.csv("SMSSpamCollection", sep = "\t", inferSchema=True, header = False)
input_text.show()

+----+--------------------+
| _c0|                 _c1|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
|spam|FreeMsg Hey there...|
| ham|Even my brother i...|
| ham|As per your reque...|
|spam|WINNER!! As a val...|
|spam|Had your mobile 1...|
| ham|I'm gonna be home...|
|spam|SIX chances to wi...|
|spam|URGENT! You have ...|
| ham|I've been searchi...|
| ham|I HAVE A DATE ON ...|
|spam|XXXMobileMovieClu...|
| ham|Oh k...i'm watchi...|
| ham|Eh u remember how...|
| ham|Fine if thats th...|
|spam|England v Macedon...|
+----+--------------------+
only showing top 20 rows



In [3]:
input_text = input_text.withColumnRenamed('_c0', 'class').withColumnRenamed('_c1', 'text')
input_text.show()

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
| spam|FreeMsg Hey there...|
|  ham|Even my brother i...|
|  ham|As per your reque...|
| spam|WINNER!! As a val...|
| spam|Had your mobile 1...|
|  ham|I'm gonna be home...|
| spam|SIX chances to wi...|
| spam|URGENT! You have ...|
|  ham|I've been searchi...|
|  ham|I HAVE A DATE ON ...|
| spam|XXXMobileMovieClu...|
|  ham|Oh k...i'm watchi...|
|  ham|Eh u remember how...|
|  ham|Fine if thats th...|
| spam|England v Macedon...|
+-----+--------------------+
only showing top 20 rows



## Clean and Prepare the Data

#### Create a new length feature

In [4]:
input_text = input_text.withColumn("length", funct.length("text"))
input_text.show()

+-----+--------------------+------+
|class|                text|length|
+-----+--------------------+------+
|  ham|Go until jurong p...|   111|
|  ham|Ok lar... Joking ...|    29|
| spam|Free entry in 2 a...|   155|
|  ham|U dun say so earl...|    49|
|  ham|Nah I don't think...|    61|
| spam|FreeMsg Hey there...|   147|
|  ham|Even my brother i...|    77|
|  ham|As per your reque...|   160|
| spam|WINNER!! As a val...|   157|
| spam|Had your mobile 1...|   154|
|  ham|I'm gonna be home...|   109|
| spam|SIX chances to wi...|   136|
| spam|URGENT! You have ...|   155|
|  ham|I've been searchi...|   196|
|  ham|I HAVE A DATE ON ...|    35|
| spam|XXXMobileMovieClu...|   149|
|  ham|Oh k...i'm watchi...|    26|
|  ham|Eh u remember how...|    81|
|  ham|Fine if thats th...|    56|
| spam|England v Macedon...|   155|
+-----+--------------------+------+
only showing top 20 rows



#### print the groupy mean of class

In [5]:
input_text.groupBy("class").agg(funct.mean('length')).show()

+-----+-----------------+
|class|      avg(length)|
+-----+-----------------+
|  ham|71.45431945307645|
| spam|138.6706827309237|
+-----+-----------------+



## Feature Transformations

In this part you transform you raw text in to tf_idf model :

- chain the transformer Tokenizer, StopWordsRemover, CountVectorizer and IDF for text to have a final column name 'tf_idf'
- use the transformer StringIndexer for class column into output column 'label'

- create feature with vector assembler 'tf_idf','length of as input columns into output column named 'features'

### use pipeline for fit and transform

Example: it may differ for you

In [6]:
input_text.createOrReplaceTempView('temp')
input_text = spark.sql('select case class when "ham" then 1.0  else 0 end as label, text from temp')
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="words")
input_words = tokenizer.transform(input_text)
input_words.show()

+-----+--------------------+--------------------+
|label|                text|               words|
+-----+--------------------+--------------------+
|  1.0|Go until jurong p...|[go, until, juron...|
|  1.0|Ok lar... Joking ...|[ok, lar..., joki...|
|  0.0|Free entry in 2 a...|[free, entry, in,...|
|  1.0|U dun say so earl...|[u, dun, say, so,...|
|  1.0|Nah I don't think...|[nah, i, don't, t...|
|  0.0|FreeMsg Hey there...|[freemsg, hey, th...|
|  1.0|Even my brother i...|[even, my, brothe...|
|  1.0|As per your reque...|[as, per, your, r...|
|  0.0|WINNER!! As a val...|[winner!!, as, a,...|
|  0.0|Had your mobile 1...|[had, your, mobil...|
|  1.0|I'm gonna be home...|[i'm, gonna, be, ...|
|  0.0|SIX chances to wi...|[six, chances, to...|
|  0.0|URGENT! You have ...|[urgent!, you, ha...|
|  1.0|I've been searchi...|[i've, been, sear...|
|  1.0|I HAVE A DATE ON ...|[i, have, a, date...|
|  0.0|XXXMobileMovieClu...|[xxxmobilemoviecl...|
|  1.0|Oh k...i'm watchi...|[oh, k...i'm, wat...|


In [7]:
from pyspark.ml.feature import CountVectorizer
count = CountVectorizer (inputCol="words", outputCol="Features")
model = count.fit(input_words)
data_features = model.transform(input_words)
data_features.show()

+-----+--------------------+--------------------+--------------------+
|label|                text|               words|            Features|
+-----+--------------------+--------------------+--------------------+
|  1.0|Go until jurong p...|[go, until, juron...|(13587,[8,42,52,6...|
|  1.0|Ok lar... Joking ...|[ok, lar..., joki...|(13587,[5,75,411,...|
|  0.0|Free entry in 2 a...|[free, entry, in,...|(13587,[0,3,8,20,...|
|  1.0|U dun say so earl...|[u, dun, say, so,...|(13587,[5,22,60,1...|
|  1.0|Nah I don't think...|[nah, i, don't, t...|(13587,[0,1,66,87...|
|  0.0|FreeMsg Hey there...|[freemsg, hey, th...|(13587,[0,2,6,10,...|
|  1.0|Even my brother i...|[even, my, brothe...|(13587,[0,7,9,13,...|
|  1.0|As per your reque...|[as, per, your, r...|(13587,[0,10,11,4...|
|  0.0|WINNER!! As a val...|[winner!!, as, a,...|(13587,[0,2,3,14,...|
|  0.0|Had your mobile 1...|[had, your, mobil...|(13587,[0,4,5,10,...|
|  1.0|I'm gonna be home...|[i'm, gonna, be, ...|(13587,[0,1,6,32,...|
|  0.0

In [8]:
from pyspark.ml.feature import  IDF
idf = IDF(inputCol="Features", outputCol="features")
idfModel = idf.fit(data_features)
clean_data = idfModel.transform(data_features)
clean_data.show()

+-----+--------------------+--------------------+--------------------+
|label|                text|               words|            features|
+-----+--------------------+--------------------+--------------------+
|  1.0|Go until jurong p...|[go, until, juron...|(13587,[8,42,52,6...|
|  1.0|Ok lar... Joking ...|[ok, lar..., joki...|(13587,[5,75,411,...|
|  0.0|Free entry in 2 a...|[free, entry, in,...|(13587,[0,3,8,20,...|
|  1.0|U dun say so earl...|[u, dun, say, so,...|(13587,[5,22,60,1...|
|  1.0|Nah I don't think...|[nah, i, don't, t...|(13587,[0,1,66,87...|
|  0.0|FreeMsg Hey there...|[freemsg, hey, th...|(13587,[0,2,6,10,...|
|  1.0|Even my brother i...|[even, my, brothe...|(13587,[0,7,9,13,...|
|  1.0|As per your reque...|[as, per, your, r...|(13587,[0,10,11,4...|
|  0.0|WINNER!! As a val...|[winner!!, as, a,...|(13587,[0,2,3,14,...|
|  0.0|Had your mobile 1...|[had, your, mobil...|(13587,[0,4,5,10,...|
|  1.0|I'm gonna be home...|[i'm, gonna, be, ...|(13587,[0,1,6,32,...|
|  0.0

### Detect spam or Ham

now use your tf-idf data to classify spam and ham

feel free to use any classifier model

result may differ for you

In [9]:
#splitting data into test and training data
i=0
data_train, data_test = clean_data.randomSplit([0.75,0.25],i)
data_train.show()

+-----+--------------------+--------------------+--------------------+
|label|                text|               words|            features|
+-----+--------------------+--------------------+--------------------+
|  0.0|* FREE* POLYPHONI...|[*, free*, polyph...|(13587,[0,4,11,12...|
|  0.0|**FREE MESSAGE**T...|[**free, message*...|(13587,[4,10,20,5...|
|  0.0|+123 Congratulati...|[+123, congratula...|(13587,[0,4,5,8,1...|
|  0.0|+123 Congratulati...|[+123, congratula...|(13587,[0,4,5,8,1...|
|  0.0|+449071512431 URG...|[+449071512431, u...|(13587,[0,4,7,14,...|
|  0.0|-PLS STOP bootyde...|[-pls, stop, boot...|(13587,[0,2,7,24,...|
|  0.0|07732584351 - Rod...|[07732584351, -, ...|(13587,[0,2,3,10,...|
|  0.0|08714712388 betwe...|[08714712388, bet...|(13587,[353,387,8...|
|  0.0|09066362231 URGEN...|[09066362231, urg...|(13587,[0,3,4,7,1...|
|  0.0|0A$NETWORKS allow...|[0a$networks, all...|(13587,[0,3,10,16...|
|  0.0|100 dating servic...|[100, dating, ser...|(13587,[224,665,7...|
|  0.0

In [10]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import numpy as np
log_reg = LogisticRegression(maxIter = 10)

paramgrid_log_reg = ParamGridBuilder() \
    .addGrid(log_reg.regParam, np.linspace(0.3, 0.01, 10)) \
    .addGrid(log_reg.elasticNetParam, np.linspace(0.3, 0.80, 6)) \
    .build()
crossval_log_reg = CrossValidator(estimator=log_reg,
                          estimatorParamMaps=paramgrid_log_reg,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds= 5)  
cross_val_model_log_reg = crossval_log_reg.fit(data_train)
best_model_log_reg = cross_val_model_log_reg.bestModel.summary
best_model_log_reg.predictions.columns

['label',
 'text',
 'words',
 'features',
 'rawPrediction',
 'probability',
 'prediction']

In [11]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
bin_class = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label', metricName='areaUnderROC')
bin_class.evaluate(best_model_log_reg.predictions)

1.0

In [12]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
mul_class = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')
mul_class.evaluate(best_model_log_reg.predictions)

1.0

In [13]:
mul_class = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='f1')
mul_class.evaluate(best_model_log_reg.predictions)

1.0

In [14]:
fit_data_train = best_model_log_reg.predictions.select('label','prediction')
fit_data_train.groupBy('label','prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  0.0|       0.0|  553|
|  1.0|       1.0| 3594|
+-----+----------+-----+



In [15]:
test_result = cross_val_model_log_reg.transform(data_test)
test_result.show()

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|label|                text|               words|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|  0.0|(Bank of Granite ...|[(bank, of, grani...|(13587,[3,7,10,12...|[-4.0435855128697...|[0.01723232933579...|       1.0|
|  0.0|2p per min to cal...|[2p, per, min, to...|(13587,[0,10,11,1...|[7.35214183781345...|[0.99935919339179...|       0.0|
|  0.0|3 FREE TAROT TEXT...|[3, free, tarot, ...|(13587,[0,10,11,5...|[3.66606929116433...|[0.97506105109585...|       0.0|
|  0.0|5 Free Top Polyph...|[5, free, top, po...|(13587,[0,3,15,34...|[2.97725530351675...|[0.95153595558103...|       0.0|
|  0.0|500 New Mobiles f...|[500, new, mobile...|(13587,[0,40,64,8...|[3.58582954630600...|[0.97303366766922...|       0.0|
|  0.0|5

In [16]:
test_result.groupBy('label','prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  0.0|       1.0|   41|
|  0.0|       0.0|  153|
|  1.0|       1.0| 1233|
+-----+----------+-----+



### Calculate the accuracy of your model

In [17]:
multi_class_acc = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')
multi_class_acc.evaluate(test_result)

0.9712683952347583

In [18]:
multi_class_f1 = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='f1')
multi_class_f1.evaluate(test_result)

0.9698059362766078