<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Spam Classification with </b> <span style="font-weight:bold; color:green">Spark Streaming</span></div><hr>
<div style="text-align:right;">S. Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<p>Do not run the code below in this jupyter notebook. You have to combine all pieces of code to a single file and run it in terminal</p>

<p><b>1. Classification model</b></p>

<div class="msg-block msg-warning">
  <p class="msg-text-warn">This model is just for demonstration purpose as some important steps are omitted such as preprocessing input message, model selection, testing etc.</p>
</div>

In [None]:
from sklearn import linear_model

import numpy as np
import pandas as pnd

from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import TfidfVectorizer


# ====================================
#
# INPUT DATA FOR TRAINING
#
# ====================================

# Dataframe columns
columns = ["class", "message"]

# Convertor: names to numbers
def convert2bool(el):
    return 0 if el == "ham" else 1

# Convertor list for dataframe column transformation at init state
converters = {"class": convert2bool}

# Creating pandas dataframe
df = pnd.read_csv("/FULL_PATH/data/SMSSpamCollection", 
                  sep="\t",
                  converters = converters,
                  names=columns)


# ====================================
#
# MODEL
#
# ====================================

ngram = 1

# TF-IDF model
tf_idf_model = TfidfVectorizer(min_df=1, ngram_range=(1,ngram))

# Converting input data (message) to tf-idf
tf_idf = tf_idf_model.fit_transform(df["message"]) 

# Init linear model for classification
lr_model = LogisticRegression(penalty="l2", 
                              fit_intercept=True, 
                              max_iter=100, 
                              C=1400,
                              solver="lbfgs", 
                              random_state=12345)

# Training the model
lr_model.fit(tf_idf, df["class"])

<p><b>2. Spark Streaming App for classifying input message </b></p>

In [None]:
import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext


# Create Spark Context
sc = SparkContext(appName="SpamClassification")

# Set log level
#sc.setLogLevel("INFO")

# Transfer model to all nodes
tf_idf_model_broadcast = sc.broadcast(tf_idf_model)
lr_model_broadcast = sc.broadcast(lr_model)

# Create Streaming Context
ssc = StreamingContext(sc, 10)

# Create a stream
lines = ssc.socketTextStream("localhost", 9999)

# Create RDD with transformed messages to TF-IDF vector
tf_idf_messages = lines.map(lambda row: (row, tf_idf_model_broadcast.value.transform([row])))

def predic_message_class(message_tf_idf):
    # Classify message
    pred = lr_model_broadcast.value.predict(message_tf_idf)
    return "spam" if pred[0] else "ham"

# Create RDD for predictions
predictions = tf_idf_messages.map(lambda row: (row[0], predic_message_class(row[1])))

# Print the result (10 records) in terminal
predictions.pprint()
                                  
# If you want to save the result in local files
#predictions.transform(lambda rdd: rdd.coalesce(1)).saveAsTextFiles("FILE_PATH")

# Start Spark Streaming
ssc.start()

# Await terminiation
ssc.awaitTermination()

<p><b>3. Running Spark Streaming app in terminal</b></p>

<p>Combine above code into a single file and run the command below to start spark streaming application</p>

In [None]:
spark-submit --master local[2] /YOUR_PATH/spark_streaming_spam_classification.py

<p>Every 10 seconds you must see in terminal a list of messages with minibatch timestamp. Now it's empty</p>

In [None]:
-------------------------------------------
Time: 2018-11-15 02:58:11
-------------------------------------------


<p><b>4. Testing App in terminal</b></p>

<p>Open a new terminal window and use the netcat tool to create listener for the 9999 port</p>

In [None]:
nc -lk 9999

<p>Enter some messages in terminal like "free smartphone" and click the Enter. See a result in the spark app terminal. Propably, you see something like the following</p>

In [None]:
-------------------------------------------
Time: 2018-11-15 03:00:20
-------------------------------------------
('free smartphone ', 'spam')
('send me documents', 'ham')
