**TEXT MESSAGES SPAM DETECTION USING JOHN SNOW NLP MODEL**

Setting up Colab 

In [1]:
# Install java
!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
!java -version

# Install pyspark
!pip install --ignore-installed -q pyspark==2.4.4

# Install Sparknlp
!pip install --ignore-installed spark-nlp

openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.18.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)
[K     |████████████████████████████████| 215.7MB 70kB/s 
[K     |████████████████████████████████| 204kB 47.6MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Collecting spark-nlp
[?25l  Downloading https://files.pythonhosted.org/packages/32/9e/2f43d668eefea486e7417c1e83554c72a41e0786976e9429846b753f5014/spark_nlp-2.7.3-py2.py3-none-any.whl (138kB)
[K     |████████████████████████████████| 143kB 13.4MB/s 
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-2.7.3


In [3]:
import pandas as pd
import numpy as np
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

Start Spark Session 

In [49]:
spark = sparknlp.start()

Select the DL model

In [50]:
### Select Model
model_name = 'classifierdl_use_spam'

In [51]:
# text_list=[
# """Hiya do u like the hlday pics looked horrible in them so took mo out! Hows the camp Amrca thing? Speak soon Serena:)""",
# """U have a secret admirer who is looking 2 make contact with U-find out who they R*reveal who thinks UR so special-call on 09058094594""",]


# import data
df_pd = pd.read_csv('spamraw_train.csv')
df = spark.createDataFrame(df_pd)

In [52]:
df.show()

+---+--------------------+----+
| id|            sms_text|spam|
+---+--------------------+----+
|  1|Hope you are havi...|   0|
|  2|K..give back my t...|   0|
|  3|Am also doing in ...|   0|
|  4|complimentary 4 S...|   1|
|  5|okmail: Dear Dave...|   1|
|  6|Aiya we discuss l...|   0|
|  7|Are you this much...|   0|
|  8|Please ask mummy ...|   0|
|  9|Marvel Mobile Pla...|   1|
| 10|fyi I'm at usf no...|   0|
| 11|Sure thing big ma...|   0|
| 12|   I anything lor...|   0|
| 13|By march ending, ...|   0|
| 14|Hmm well, night n...|   0|
| 15|K I'll be sure to...|   0|
| 16|Ha ha cool cool c...|   0|
| 17|Darren was saying...|   0|
| 18|He dint tell anyt...|   0|
| 19|Up to u... u wan ...|   0|
| 20|U can WIN £100 of...|   1|
+---+--------------------+----+
only showing top 20 rows



In [53]:
# Split the data

train, test = df.randomSplit([0.80,0.20],seed=10)

In [55]:
test.count()

941

Define Spark NLP Pipeline

In [56]:
### Select Model
model_name = 'classifierdl_use_spam'

documentAssembler = DocumentAssembler()\
    .setInputCol("sms_text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained(lang="en") \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

document_classifier = ClassifierDLModel.pretrained(model_name)\
  .setInputCols(['document', 'sentence_embeddings']).setOutputCol("class")

nlpPipeline = Pipeline(stages=[
 documentAssembler, 
 use,
 document_classifier
 ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
classifierdl_use_spam download started this may take some time.
Approximate size to download 21.3 MB
[OK!]


Run the Pipeline

In [57]:
# empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = nlpPipeline.fit(train)
# df = spark.createDataFrame(pd.DataFrame({"text":text_list}))
result = pipelineModel.transform(train) 

 Visualize results

In [58]:
res = result.select(F.explode(F.arrays_zip('document.result', 'class.result')).alias("cols")) \
            .select(F.expr("cols['0']").alias("document"),
            F.expr("cols['1']").alias("class"))

In [59]:
res.count()

4059

In [60]:
res.show()

+--------------------+-----+
|            document|class|
+--------------------+-----+
|Hope you are havi...|  ham|
|K..give back my t...|  ham|
|complimentary 4 S...| spam|
|okmail: Dear Dave...| spam|
|Aiya we discuss l...|  ham|
|Are you this much...|  ham|
|Please ask mummy ...|  ham|
|fyi I'm at usf no...|  ham|
|Sure thing big ma...|  ham|
|   I anything lor...|  ham|
|By march ending, ...|  ham|
|Hmm well, night n...|  ham|
|K I'll be sure to...|  ham|
|Ha ha cool cool c...|  ham|
|Darren was saying...|  ham|
|He dint tell anyt...|  ham|
|Up to u... u wan ...|  ham|
|U can WIN £100 of...| spam|
|2mro i am not com...|  ham|
|ARR birthday toda...|  ham|
+--------------------+-----+
only showing top 20 rows



In [61]:
res.printSchema()

root
 |-- document: string (nullable = true)
 |-- class: string (nullable = true)



In [17]:
from google.colab import files
# pd_res = res.toPandas()
# pd_res.to_csv('/content/drive/MyDrive/BigData/res.csv')

In [62]:
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
res = res.withColumn("id", row_number().over(w))

In [63]:
res.show()

+--------------------+-----+---+
|            document|class| id|
+--------------------+-----+---+
|Hope you are havi...|  ham|  1|
|K..give back my t...|  ham|  2|
|complimentary 4 S...| spam|  3|
|okmail: Dear Dave...| spam|  4|
|Aiya we discuss l...|  ham|  5|
|Are you this much...|  ham|  6|
|Please ask mummy ...|  ham|  7|
|fyi I'm at usf no...|  ham|  8|
|Sure thing big ma...|  ham|  9|
|   I anything lor...|  ham| 10|
|By march ending, ...|  ham| 11|
|Hmm well, night n...|  ham| 12|
|K I'll be sure to...|  ham| 13|
|Ha ha cool cool c...|  ham| 14|
|Darren was saying...|  ham| 15|
|He dint tell anyt...|  ham| 16|
|Up to u... u wan ...|  ham| 17|
|U can WIN £100 of...| spam| 18|
|2mro i am not com...|  ham| 19|
|ARR birthday toda...|  ham| 20|
+--------------------+-----+---+
only showing top 20 rows



In [64]:
# convert res to pandas
pd_restrain = res.toPandas()

In [65]:
display(pd_restrain)

Unnamed: 0,document,class,id
0,Hope you are having a good week. Just checking in,ham,1
1,K..give back my thanks.,ham,2
2,"complimentary 4 STAR Ibiza Holiday or £10,000 ...",spam,3
3,okmail: Dear Dave this is your final notice to...,spam,4
4,Aiya we discuss later lar... Pick u up at 4 is...,ham,5
...,...,...,...
4054,Is there any movie theatre i can go to and wat...,ham,4055
4055,Customer service announcement. We recently tri...,spam,4056
4056,Aiyar dun disturb u liao... Thk u have lots 2 ...,ham,4057
4057,"SMS SERVICES. for your inclusive text credits,...",spam,4058


In [66]:
pd_train = train.toPandas()
display(pd_train)

Unnamed: 0,id,sms_text,spam
0,1,Hope you are having a good week. Just checking in,0
1,2,K..give back my thanks.,0
2,4,"complimentary 4 STAR Ibiza Holiday or £10,000 ...",1
3,5,okmail: Dear Dave this is your final notice to...,1
4,6,Aiya we discuss later lar... Pick u up at 4 is...,0
...,...,...,...
4054,4994,Is there any movie theatre i can go to and wat...,0
4055,4995,Customer service announcement. We recently tri...,1
4056,4996,Aiyar dun disturb u liao... Thk u have lots 2 ...,0
4057,4997,"SMS SERVICES. for your inclusive text credits,...",1


In [67]:
train_pd = pd.concat([pd_train,pd_restrain['class']],axis=1)
#train_pd.drop(["document"],inplace=True,axis=1)
display(train_pd)

Unnamed: 0,id,sms_text,spam,class
0,1,Hope you are having a good week. Just checking in,0,ham
1,2,K..give back my thanks.,0,ham
2,4,"complimentary 4 STAR Ibiza Holiday or £10,000 ...",1,spam
3,5,okmail: Dear Dave this is your final notice to...,1,spam
4,6,Aiya we discuss later lar... Pick u up at 4 is...,0,ham
...,...,...,...,...
4054,4994,Is there any movie theatre i can go to and wat...,0,ham
4055,4995,Customer service announcement. We recently tri...,1,spam
4056,4996,Aiyar dun disturb u liao... Thk u have lots 2 ...,0,ham
4057,4997,"SMS SERVICES. for your inclusive text credits,...",1,spam


In [68]:
train_pd['class']=train_pd['class'].map(dict(ham=0, spam=1))

In [69]:
display(train_pd)

Unnamed: 0,id,sms_text,spam,class
0,1,Hope you are having a good week. Just checking in,0,0
1,2,K..give back my thanks.,0,0
2,4,"complimentary 4 STAR Ibiza Holiday or £10,000 ...",1,1
3,5,okmail: Dear Dave this is your final notice to...,1,1
4,6,Aiya we discuss later lar... Pick u up at 4 is...,0,0
...,...,...,...,...
4054,4994,Is there any movie theatre i can go to and wat...,0,0
4055,4995,Customer service announcement. We recently tri...,1,1
4056,4996,Aiyar dun disturb u liao... Thk u have lots 2 ...,0,0
4057,4997,"SMS SERVICES. for your inclusive text credits,...",1,1


Test Split Data

In [70]:
result_test = pipelineModel.transform(test) 

In [71]:
res_test = result_test.select(F.explode(F.arrays_zip('document.result', 'class.result')).alias("cols")) \
            .select(F.expr("cols['0']").alias("document"),
            F.expr("cols['1']").alias("class"))

In [72]:
pd_restest = res_test.toPandas()
pd_test = test.toPandas()

test_pd = pd.concat([pd_test,pd_restest['class']],axis=1)
test_pd['class']=test_pd['class'].map(dict(ham=0, spam=1))
display(test_pd)

Unnamed: 0,id,sms_text,spam,class
0,3,Am also doing in cbe only. But have to pay.,0,0
1,9,Marvel Mobile Play the official Ultimate Spide...,1,1
2,27,Well if I'm that desperate I'll just call arma...,0,0
3,28,"Fuuuuck I need to stop sleepin, sup",0,0
4,31,Ok.ok ok..then..whats ur todays plan,0,0
...,...,...,...,...
936,4975,Ranjith cal drpd Deeraj and deepak 5min hold,0,0
937,4980,LookAtMe!: Thanks for your purchase of a video...,1,1
938,4993,Todays Voda numbers ending with 7634 are selec...,1,1
939,4998,If you're not in my car in an hour and a half ...,0,0


In [75]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, f1_score

# pred_val = pipe.predict(X_val)

# check on train or test
y_val = train_pd['spam'].to_numpy()
pred_val = train_pd['class'].to_numpy()

print("Confusion matrix:")
print(confusion_matrix(y_val, pred_val))

print("\nF1 Score = {:.5f}".format(f1_score(y_val, pred_val, average="micro")))

print("\nClassification Report:")
print(classification_report(y_val, pred_val))

Confusion matrix:
[[3480   28]
 [  30  521]]

F1 Score = 0.98571

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      3508
           1       0.95      0.95      0.95       551

    accuracy                           0.99      4059
   macro avg       0.97      0.97      0.97      4059
weighted avg       0.99      0.99      0.99      4059



In [73]:
test_kag = pd.read_csv('spamraw_test.csv')
df_test = spark.createDataFrame(test_kag)

In [74]:
df_test.show()

+-----+--------------------+
|   id|            sms_text|
+-----+--------------------+
|12000|HOT LIVE FANTASIE...|
|12001|I not at home now...|
|12002|So how's scotland...|
|12003|Yo you around? A ...|
|12004|I'm aight. Wat's ...|
|12005|SMS. ac Sptv: The...|
|12006|I'm taking derek ...|
|12007|Okay name ur pric...|
|12008|Kallis is ready f...|
|12009|I'll get there at...|
|12010|Shall i come to g...|
|12011|Carry on not dist...|
|12012|Hi dis is yijue i...|
|12013|I need... Coz i n...|
|12014|SMS. ac Blind Dat...|
|12015|Was gr8 to see th...|
|12016|I have no idea wh...|
|12017|Hello which the s...|
|12018|U studying in sch...|
|12019|Thank you. I like...|
+-----+--------------------+
only showing top 20 rows



In [77]:
test_kag = pd.read_csv('spamraw_test.csv')
df_test = spark.createDataFrame(test_kag)

In [78]:
df_test.show()

+-----+--------------------+
|   id|            sms_text|
+-----+--------------------+
|12000|HOT LIVE FANTASIE...|
|12001|I not at home now...|
|12002|So how's scotland...|
|12003|Yo you around? A ...|
|12004|I'm aight. Wat's ...|
|12005|SMS. ac Sptv: The...|
|12006|I'm taking derek ...|
|12007|Okay name ur pric...|
|12008|Kallis is ready f...|
|12009|I'll get there at...|
|12010|Shall i come to g...|
|12011|Carry on not dist...|
|12012|Hi dis is yijue i...|
|12013|I need... Coz i n...|
|12014|SMS. ac Blind Dat...|
|12015|Was gr8 to see th...|
|12016|I have no idea wh...|
|12017|Hello which the s...|
|12018|U studying in sch...|
|12019|Thank you. I like...|
+-----+--------------------+
only showing top 20 rows



In [79]:
result_kaggle = pipelineModel.transform(df_test) 

In [80]:
result_kaggle.show()

+-----+--------------------+--------------------+--------------------+--------------------+
|   id|            sms_text|            document| sentence_embeddings|               class|
+-----+--------------------+--------------------+--------------------+--------------------+
|12000|HOT LIVE FANTASIE...|[[document, 0, 10...|[[sentence_embedd...|[[category, 0, 10...|
|12001|I not at home now...|[[document, 0, 23...|[[sentence_embedd...|[[category, 0, 23...|
|12002|So how's scotland...|[[document, 0, 94...|[[sentence_embedd...|[[category, 0, 94...|
|12003|Yo you around? A ...|[[document, 0, 64...|[[sentence_embedd...|[[category, 0, 64...|
|12004|I'm aight. Wat's ...|[[document, 0, 39...|[[sentence_embedd...|[[category, 0, 39...|
|12005|SMS. ac Sptv: The...|[[document, 0, 11...|[[sentence_embedd...|[[category, 0, 11...|
|12006|I'm taking derek ...|[[document, 0, 14...|[[sentence_embedd...|[[category, 0, 14...|
|12007|Okay name ur pric...|[[document, 0, 80...|[[sentence_embedd...|[[category

In [81]:
res_kaggle = result_kaggle.select(F.explode(F.arrays_zip('document.result', 'class.result')).alias("cols")) \
            .select(F.expr("cols['0']").alias("document"),
            F.expr("cols['1']").alias("class"))

In [82]:
res_kaggle.show()

+--------------------+-----+
|            document|class|
+--------------------+-----+
|HOT LIVE FANTASIE...| spam|
|I not at home now...|  ham|
|So how's scotland...|  ham|
|Yo you around? A ...|  ham|
|I'm aight. Wat's ...|  ham|
|SMS. ac Sptv: The...| spam|
|I'm taking derek ...|  ham|
|Okay name ur pric...|  ham|
|Kallis is ready f...|  ham|
|I'll get there at...|  ham|
|Shall i come to g...|  ham|
|Carry on not dist...|  ham|
|Hi dis is yijue i...|  ham|
|I need... Coz i n...|  ham|
|SMS. ac Blind Dat...| spam|
|Was gr8 to see th...|  ham|
|I have no idea wh...|  ham|
|Hello which the s...|  ham|
|U studying in sch...|  ham|
|Thank you. I like...|  ham|
+--------------------+-----+
only showing top 20 rows



In [83]:
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
res_kaggle = res_kaggle.withColumn("id",12000+row_number().over(w)-1)

In [84]:
res_kaggle.show()

+--------------------+-----+-----+
|            document|class|   id|
+--------------------+-----+-----+
|HOT LIVE FANTASIE...| spam|12000|
|I not at home now...|  ham|12001|
|So how's scotland...|  ham|12002|
|Yo you around? A ...|  ham|12003|
|I'm aight. Wat's ...|  ham|12004|
|SMS. ac Sptv: The...| spam|12005|
|I'm taking derek ...|  ham|12006|
|Okay name ur pric...|  ham|12007|
|Kallis is ready f...|  ham|12008|
|I'll get there at...|  ham|12009|
|Shall i come to g...|  ham|12010|
|Carry on not dist...|  ham|12011|
|Hi dis is yijue i...|  ham|12012|
|I need... Coz i n...|  ham|12013|
|SMS. ac Blind Dat...| spam|12014|
|Was gr8 to see th...|  ham|12015|
|I have no idea wh...|  ham|12016|
|Hello which the s...|  ham|12017|
|U studying in sch...|  ham|12018|
|Thank you. I like...|  ham|12019|
+--------------------+-----+-----+
only showing top 20 rows



In [85]:
pd_res_kaggle = res_kaggle.toPandas()
display(pd_res_kaggle)

Unnamed: 0,document,class,id
0,HOT LIVE FANTASIES call now 08707509020 Just 2...,spam,12000
1,I not at home now lei...,ham,12001
2,So how's scotland. Hope you are not over showi...,ham,12002
3,Yo you around? A friend of mine's lookin to pi...,ham,12003
4,I'm aight. Wat's happening on your side.,ham,12004
...,...,...,...
554,You are a great role model. You are giving so ...,ham,12554
555,"Awesome, I remember the last time we got someb...",ham,12555
556,"If you don't, your prize will go to another cu...",spam,12556
557,"SMS. ac JSco: Energy is high, but u may not kn...",ham,12557


In [86]:
pd_res_kaggle['class']=pd_res_kaggle['class'].map(dict(ham=0, spam=1))
display(pd_res_kaggle)

Unnamed: 0,document,class,id
0,HOT LIVE FANTASIES call now 08707509020 Just 2...,1,12000
1,I not at home now lei...,0,12001
2,So how's scotland. Hope you are not over showi...,0,12002
3,Yo you around? A friend of mine's lookin to pi...,0,12003
4,I'm aight. Wat's happening on your side.,0,12004
...,...,...,...
554,You are a great role model. You are giving so ...,0,12554
555,"Awesome, I remember the last time we got someb...",0,12555
556,"If you don't, your prize will go to another cu...",1,12556
557,"SMS. ac JSco: Energy is high, but u may not kn...",0,12557


In [87]:
pd_res_kaggle.rename(columns={'class':"predicted"},inplace=True)

In [88]:

my_submission = pd.DataFrame({'id': pd_res_kaggle.id, 'predicted': pd_res_kaggle.predicted})
my_submission.to_csv('test.csv')