# NLP Code Along questions

For this code along we will build a spam filter!

We'll use a classic dataset for this - UCI Repository SMS Spam Detection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

#### load and  read the dataset,  have Spark infer the data types

In [16]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlp').getOrCreate()

In [17]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, CountVectorizer, IDF, StringIndexer, VectorAssembler
from pyspark.sql.functions import col, udf, length
from pyspark.sql.types import IntegerType

In [18]:
dataset = spark.read.csv("SMSSpamCollection",inferSchema=True,sep='\t')
dataset.show()

+----+--------------------+
| _c0|                 _c1|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
|spam|FreeMsg Hey there...|
| ham|Even my brother i...|
| ham|As per your reque...|
|spam|WINNER!! As a val...|
|spam|Had your mobile 1...|
| ham|I'm gonna be home...|
|spam|SIX chances to wi...|
|spam|URGENT! You have ...|
| ham|I've been searchi...|
| ham|I HAVE A DATE ON ...|
|spam|XXXMobileMovieClu...|
| ham|Oh k...i'm watchi...|
| ham|Eh u remember how...|
| ham|Fine if thats th...|
|spam|England v Macedon...|
+----+--------------------+
only showing top 20 rows



## Clean and Prepare the Data

#### Create a new length feature

In [19]:
dataset = dataset.withColumn('length', length(dataset['_c1']))
dataset.show()

+----+--------------------+------+
| _c0|                 _c1|length|
+----+--------------------+------+
| ham|Go until jurong p...|   111|
| ham|Ok lar... Joking ...|    29|
|spam|Free entry in 2 a...|   155|
| ham|U dun say so earl...|    49|
| ham|Nah I don't think...|    61|
|spam|FreeMsg Hey there...|   147|
| ham|Even my brother i...|    77|
| ham|As per your reque...|   160|
|spam|WINNER!! As a val...|   157|
|spam|Had your mobile 1...|   154|
| ham|I'm gonna be home...|   109|
|spam|SIX chances to wi...|   136|
|spam|URGENT! You have ...|   155|
| ham|I've been searchi...|   196|
| ham|I HAVE A DATE ON ...|    35|
|spam|XXXMobileMovieClu...|   149|
| ham|Oh k...i'm watchi...|    26|
| ham|Eh u remember how...|    81|
| ham|Fine if thats th...|    56|
|spam|England v Macedon...|   155|
+----+--------------------+------+
only showing top 20 rows



#### print the groupy mean of class

In [20]:
dataset.groupby('_c0').mean().show()

+----+-----------------+
| _c0|      avg(length)|
+----+-----------------+
| ham|71.45431945307645|
|spam|138.6706827309237|
+----+-----------------+



## Feature Transformations

In this part you transform you raw text in to tf_idf model :

- chain the transformer Tokenizer, StopWordsRemover, CountVectorizer and IDF for text to have a final column name 'tf_idf'
- use the transformer StringIndexer for class column into output column 'label'

- create feature with vector assembler 'tf_idf','length of as input columns into output column named 'features'

In [21]:
tokenizer = Tokenizer(inputCol="_c1", outputCol="_c1_token")
remover = StopWordsRemover(inputCol="_c1_token", outputCol="_c1_filtered")
cv = CountVectorizer(inputCol="_c1_filtered", outputCol="_c1_cv")
idf = IDF(inputCol="_c1_cv", outputCol="tf_idf")
si = StringIndexer(inputCol='_c0',outputCol='label')
va = VectorAssembler(inputCols=['tf_idf','length'],outputCol='features')

### use pipeline for fit and transform

Example: it may differ for you

In [22]:
from pyspark.ml import Pipeline
p_line = Pipeline(stages=[tokenizer, remover, cv, idf, si, va])
p_model = p_line.fit(dataset)
clean_data = p_model.transform(dataset)
clean_data = clean_data.select(['label','features'])
clean_data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(13424,[7,11,31,6...|
|  0.0|(13424,[0,24,297,...|
|  1.0|(13424,[2,13,19,3...|
|  0.0|(13424,[0,70,80,1...|
|  0.0|(13424,[36,134,31...|
|  1.0|(13424,[10,60,139...|
|  0.0|(13424,[10,53,103...|
|  0.0|(13424,[125,184,4...|
|  1.0|(13424,[1,47,118,...|
|  1.0|(13424,[0,1,13,27...|
|  0.0|(13424,[18,43,120...|
|  1.0|(13424,[8,17,37,8...|
|  1.0|(13424,[13,30,47,...|
|  0.0|(13424,[39,96,217...|
|  0.0|(13424,[552,1697,...|
|  1.0|(13424,[30,109,11...|
|  0.0|(13424,[82,214,47...|
|  0.0|(13424,[0,2,49,13...|
|  0.0|(13424,[0,74,105,...|
|  1.0|(13424,[4,30,33,5...|
+-----+--------------------+
only showing top 20 rows



### Detect spam or Ham

now use your tf-idf data to classify spam and ham

feel free to use any classifier model

result may differ for you

In [23]:
(train, test) = clean_data.randomSplit([0.8,0.2])

In [24]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression()
lr_classifier = lr.fit(train)
pred_lr = lr_classifier.transform(test)

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
nb_classifier = nb.fit(train)
pred_nb = nb_classifier.transform(test)

In [25]:
pred_lr.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(13424,[0,1,2,13,...|[19.2180739433199...|[0.99999999549498...|       0.0|
|  0.0|(13424,[0,1,12,33...|[28.4420722561174...|[0.99999999999955...|       0.0|
|  0.0|(13424,[0,1,14,78...|[30.6844364015617...|[0.99999999999995...|       0.0|
|  0.0|(13424,[0,1,146,1...|[18.2972612834777...|[0.99999998868641...|       0.0|
|  0.0|(13424,[0,1,498,5...|[18.2628798756859...|[0.99999998829067...|       0.0|
|  0.0|(13424,[0,1,874,1...|[13.0310820280527...|[0.99999780885039...|       0.0|
|  0.0|(13424,[0,1,874,1...|[13.1081856910984...|[0.99999797144678...|       0.0|
|  0.0|(13424,[0,2,3,5,6...|[56.1195717396599...|[1.0,4.2420879335...|       0.0|
|  0.0|(13424,[0,2,3,5,3...|[19.6874678011734...|[0.99999999718264...|       0.0|
|  0.0|(13424,[0

In [26]:
pred_nb.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(13424,[0,1,2,13,...|[-610.79815822679...|[1.0,2.3686196747...|       0.0|
|  0.0|(13424,[0,1,12,33...|[-450.65668202381...|[1.0,8.3598469825...|       0.0|
|  0.0|(13424,[0,1,14,78...|[-689.69780596248...|[1.0,5.6344961383...|       0.0|
|  0.0|(13424,[0,1,146,1...|[-252.84380660161...|[0.89357363744565...|       0.0|
|  0.0|(13424,[0,1,498,5...|[-321.30704753453...|[0.99999999999645...|       0.0|
|  0.0|(13424,[0,1,874,1...|[-95.878781989564...|[0.99999998533590...|       0.0|
|  0.0|(13424,[0,1,874,1...|[-97.551562867474...|[0.99999998847918...|       0.0|
|  0.0|(13424,[0,2,3,5,6...|[-2570.6773492518...|[1.0,2.8143697323...|       0.0|
|  0.0|(13424,[0,2,3,5,3...|[-489.24566631553...|[1.0,4.8787732690...|       0.0|
|  0.0|(13424,[0

### Calculate the accuracy of your model

In [33]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from sklearn.metrics import classification_report, confusion_matrix

mce = MulticlassClassificationEvaluator()

accuracy_lr = mce.evaluate(pred_lr)
print("Accuracy of Logistic regression: {}".format(accuracy_lr))

accuracy_nb = mce.evaluate(pred_nb)
print("Accuracy of Naive Bayes: {}".format(accuracy_nb))

y_true_ = test.select(['label']).collect()
y_pred_lr = pred_lr.select(['prediction']).collect()
y_pred_nb = pred_nb.select(['prediction']).collect()

print('\n', classification_report(y_true, y_pred_lr))
print(classification_report(y_true, y_pred_nb))

Accuracy of Logistic regression: 0.9657210192437184
Accuracy of Naive Bayes: 0.9213967408750166

               precision    recall  f1-score   support

         0.0       0.97      1.00      0.98       969
         1.0       0.99      0.76      0.86       143

    accuracy                           0.97      1112
   macro avg       0.98      0.88      0.92      1112
weighted avg       0.97      0.97      0.97      1112

              precision    recall  f1-score   support

         0.0       1.00      0.90      0.95       969
         1.0       0.60      0.99      0.75       143

    accuracy                           0.91      1112
   macro avg       0.80      0.95      0.85      1112
weighted avg       0.95      0.91      0.92      1112

