# Assignment 3: streaming analytics on text data

## Spark setting and data import

The first step is to introduce saved stories collected via Spark. For sake of model performance, dataframe used is Saprk Dataframe instead of RDD or Pandas Dataframe. First of all, a PySpark environment has been set up. The saved stories is read through.

In [None]:
from pyspark.sql.functions import col







In [14]:
sc

In [15]:
spark

In [131]:
# Read through all the subdirectories saved
df = spark.read.json("C:/Users/Admin/Advanced Analytics for Bid Data World/Assignment 3/saved_stories/*")

# Show top 20 rows to observe whether it is correctly formatted and count
df.show()
df.count()

+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------+-----+
|     aid|comments|              domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|          user|votes|
+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------+-----+
|39988880|       0|themeasureofaplan...|    false|2024-04-10 09:57:10|Moonshine Money: ...|Moonshine Money: ...|Moonshine Money: ...|https://themeasur...|getToTheChopin|    3|
|39988889|       0|    scitechdaily.com|    false|2024-04-10 09:58:28|New Jurassic Foss...|New Jurassic Foss...|New Jurassic Foss...|https://scitechda...|    isaacfrond|    1|
|39988912|       0|           proton.me|    false|2024-04-10 10:01:14|Proton and Standa...|Proton and Standa...|Proton a

1346

## Preprocession

Transaformations on the Spark Dataframe is executed for model training.

### Removing duplicates

In [132]:
# Remove duplicate rows and count
df = df.dropDuplicates()
df.count()

792

In [133]:
df = df.withColumn('frontpage', when(df.frontpage==True, 1).otherwise(0))

In [134]:
df.show()

+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------+-----+
|     aid|comments|              domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|          user|votes|
+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------+-----+
|39988706|       0|       wikipedia.org|        1|2024-04-10 09:29:10|Chess therapy - W...|       Chess therapy|       Chess Therapy|https://en.wikipe...|getToTheChopin|    4|
|39998952|       0|          bps.org.uk|        0|2024-04-11 06:22:26|Will the debate a...|Will the debate a...|Will the debate a...|https://www.bps.o...|          wjb3|    1|
|39989330|       0|           quora.com|        0|2024-04-10 11:05:12|How to remember t...|How do you rememb...|How to r

In [137]:
# Keep necessary columns ('source_title', 'frontpage', 'comments', 'votes', 'domain', 'posted_at') and show the transformed dataframe for check

df = df.select(col('comments'), col('frontpage'), col('source_text'), col('votes'))
df.show(50)

+--------+---------+--------------------+-----+
|comments|frontpage|         source_text|votes|
+--------+---------+--------------------+-----+
|       0|        1|Chess therapy - W...|    4|
|       0|        0|Will the debate a...|    1|
|       0|        0|How to remember t...|    1|
|       2|        0|GitHub - Miscella...|    3|
|       2|        1|PFAS: EPA's new r...|   39|
|       0|        0|[2404.05961] LLM2...|    1|
|       0|        0|Quando Aggiorname...|    1|
|       0|        1|Brazil's Twitter ...|    4|
|       0|        0|No Substitute for...|    1|
|       0|        0|Moore's Law for E...|    1|
|       0|        0|Why workplaces sh...|    1|
|       0|        0|SEOperate: Notion...|    2|
|       0|        0|From bug detectio...|    1|
|       5|        1|Why Can't My Mom ...|   30|
|       0|        0|Gentoo Linux beco...|    1|
|       0|        0|AI Song Cover Gen...|    1|
|       0|        0|ClassroomIO\n\nLo...|    1|
|       0|        0|Clojure's slow st...

### Removing missing values

In [19]:
# Missing values check: 2 types could be viewed as missing values, then count
# Type 1: Page not found
# Type 2: NULL
df = df.where(df.source_text != 'Page not found')
df.dropna()
df.count()

791

### Encoding the label column

In [20]:
# Encode the label column 'frontpage' and show it to verify
from pyspark.sql.functions import when

df = df.withColumn('frontpage', when(df.frontpage==True, 1).otherwise(0))
df.show()

+--------------------+---------+
|         source_text|frontpage|
+--------------------+---------+
|Chess therapy - W...|        1|
|Will the debate a...|        0|
|How to remember t...|        0|
|GitHub - Miscella...|        0|
|PFAS: EPA's new r...|        1|
|[2404.05961] LLM2...|        0|
|Quando Aggiorname...|        0|
|Brazil's Twitter ...|        1|
|No Substitute for...|        0|
|Moore's Law for E...|        0|
|Why workplaces sh...|        0|
|SEOperate: Notion...|        0|
|From bug detectio...|        0|
|Why Can't My Mom ...|        1|
|Gentoo Linux beco...|        0|
|AI Song Cover Gen...|        0|
|ClassroomIO\n\nLo...|        0|
|Clojure's slow st...|        0|
|Frontiers | Seism...|        0|
|Run-time Polymorp...|        0|
+--------------------+---------+
only showing top 20 rows



### Removing punctuations, stopwords and tokenizing the text

In [21]:
# For text, remove the punctuations ('/"/,/./:/-/?/!/:/|/[/])
from pyspark.sql.functions import *

df_punc_drop = df.withColumn('source_text', regexp_replace(df.source_text, '[^a-zA-Z0-9]', ' '))

In [22]:
# For text, make every word in lowercase
from pyspark.ml.feature import Tokenizer

df_token = Tokenizer(inputCol="source_text", outputCol="tokens").transform(df_punc_drop)
df_token.show()

+--------------------+---------+--------------------+
|         source_text|frontpage|              tokens|
+--------------------+---------+--------------------+
|Chess therapy   W...|        1|[chess, therapy, ...|
|Will the debate a...|        0|[will, the, debat...|
|How to remember t...|        0|[how, to, remembe...|
|GitHub   Miscella...|        0|[github, , , misc...|
|PFAS  EPA s new r...|        1|[pfas, , epa, s, ...|
| 2404 05961  LLM2...|        0|[, 2404, 05961, ,...|
|Quando Aggiorname...|        0|[quando, aggiorna...|
|Brazil s Twitter ...|        1|[brazil, s, twitt...|
|No Substitute for...|        0|[no, substitute, ...|
|Moore s Law for E...|        0|[moore, s, law, f...|
|Why workplaces sh...|        0|[why, workplaces,...|
|SEOperate  Notion...|        0|[seoperate, , not...|
|From bug detectio...|        0|[from, bug, detec...|
|Why Can t My Mom ...|        1|[why, can, t, my,...|
|Gentoo Linux beco...|        0|[gentoo, linux, b...|
|AI Song Cover Gen...|      

In [23]:
# For text, remove stop words (a/an/the/then/and...)
from pyspark.ml.feature import StopWordsRemover

stopwords = StopWordsRemover()
stopwords.getStopWords()

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

### Featurization: method selection

In [24]:
stopwords = stopwords.setInputCol('tokens').setOutputCol('words')
df_clean = stopwords.transform(df_token)
df_clean.show()

+--------------------+---------+--------------------+--------------------+
|         source_text|frontpage|              tokens|               words|
+--------------------+---------+--------------------+--------------------+
|Chess therapy   W...|        1|[chess, therapy, ...|[chess, therapy, ...|
|Will the debate a...|        0|[will, the, debat...|[debate, , psi, ,...|
|How to remember t...|        0|[how, to, remembe...|[remember, differ...|
|GitHub   Miscella...|        0|[github, , , misc...|[github, , , misc...|
|PFAS  EPA s new r...|        1|[pfas, , epa, s, ...|[pfas, , epa, new...|
| 2404 05961  LLM2...|        0|[, 2404, 05961, ,...|[, 2404, 05961, ,...|
|Quando Aggiorname...|        0|[quando, aggiorna...|[quando, aggiorna...|
|Brazil s Twitter ...|        1|[brazil, s, twitt...|[brazil, twitter,...|
|No Substitute for...|        0|[no, substitute, ...|[substitute, vict...|
|Moore s Law for E...|        0|[moore, s, law, f...|[moore, law, ever...|
|Why workplaces sh...|   

#### Topic encoding

In [55]:
#from pyspark.ml.clustering import LDA
#from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, VectorAssembler

#vectorizer = CountVectorizer(inputCol="words", outputCol="features")
#vectorizer_model = vectorizer.fit(df_clean)
#bow_df = vectorizer_model.transform(df_clean)

In [65]:
# Train the LDA model

#num_topics = 5
#lda = LDA(k=num_topics, maxIter=10, featuresCol="features")
#lda_model = lda.fit(bow_df)
 
# Describe the topics
#topics = lda_model.describeTopics(5)
#print("The topics described by their top-weighted terms:")
#topics.show(truncate=False)

The topics described by their top-weighted terms:
+-----+------------------------------+------------------------------------------------------------------------------------------------------------------+
|topic|termIndices                   |termWeights                                                                                                       |
+-----+------------------------------+------------------------------------------------------------------------------------------------------------------+
|0    |[1748, 2342, 4219, 3877, 4504]|[6.311496829071136E-4, 5.418863928598731E-4, 3.556469258243872E-4, 3.305170428496798E-4, 3.2986957020548563E-4]   |
|1    |[0, 1, 2, 3, 4]               |[0.3763087549906073, 0.013699798194472474, 0.008989757792763758, 0.006896331218007347, 0.006705739623271153]      |
|2    |[2310, 2579, 2825, 3140, 3316]|[7.631756268808031E-4, 6.592717391125694E-4, 5.658672718096903E-4, 5.377084648873458E-4, 5.307010643169863E-4]    |
|3    |[231, 323, 516, 707

In [66]:
#transformed_df = lda_model.transform(bow_df)
#transformed_df.select('topicDistribution').show(truncate=False)

In [13]:
# Featurization of Spark dataframe
#from pyspark.ml.feature import Word2Vec

# Learn a mapping from words to Vectors.
#word2Vec = Word2Vec(vectorSize=50, minCount=0, inputCol="words", outputCol="features")
#word_embed = word2Vec.fit(df_clean)
#result = word_embed.transform(df_clean)
#result = result.drop('source_text','tokens','words')
#result.show(truncate=False)

+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### TF-IDF

In [74]:
# Using bag of words(TF-IDF)
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
featurizedData = hashingTF.transform(df_clean)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

In [75]:
rescaledData.show()

+--------------------+---------+--------------------+--------------------+--------------------+--------------------+
|         source_text|frontpage|              tokens|               words|         rawFeatures|            features|
+--------------------+---------+--------------------+--------------------+--------------------+--------------------+
|Chess therapy   W...|        1|[chess, therapy, ...|[chess, therapy, ...|(262144,[666,1097...|(262144,[666,1097...|
|Will the debate a...|        0|[will, the, debat...|[debate, , psi, ,...|(262144,[161,440,...|(262144,[161,440,...|
|How to remember t...|        0|[how, to, remembe...|[remember, differ...|(262144,[1578,212...|(262144,[1578,212...|
|GitHub   Miscella...|        0|[github, , , misc...|[github, , , misc...|(262144,[654,2139...|(262144,[654,2139...|
|PFAS  EPA s new r...|        1|[pfas, , epa, s, ...|[pfas, , epa, new...|(262144,[329,633,...|(262144,[329,633,...|
| 2404 05961  LLM2...|        0|[, 2404, 05961, ,...|[, 2404, 05

In [77]:
training = result.drop('source_text','tokens','words')
training.show()

+---------+--------------------+
|frontpage|            features|
+---------+--------------------+
|        1|(262144,[666,1097...|
|        0|(262144,[161,440,...|
|        0|(262144,[1578,212...|
|        0|(262144,[654,2139...|
|        1|(262144,[329,633,...|
|        0|(262144,[1296,333...|
|        0|(262144,[101,873,...|
|        1|(262144,[71,1097,...|
|        0|(262144,[6,303,33...|
|        0|(262144,[303,1578...|
|        0|(262144,[619,666,...|
|        0|(262144,[161,2701...|
|        0|(262144,[921,991,...|
|        1|(262144,[991,1546...|
|        0|(262144,[1152,213...|
|        0|(262144,[2701,332...|
|        0|(262144,[1007,300...|
|        0|(262144,[161,329,...|
|        0|(262144,[154,161,...|
|        0|(262144,[1277,230...|
+---------+--------------------+
only showing top 20 rows



In [91]:
print('Count of positive cases:', training.select('frontpage').where(training.frontpage==0).count())
print('Count of negative cases:', training.select('frontpage').where(training.frontpage==1).count())
print('Count of ratio:', training.select('frontpage').where(training.frontpage==0).count()/training.select('frontpage').where(training.frontpage==1).count())

Count of positive cases: 665
Count of negative cases: 126
Count of ratio: 5.277777777777778


### Train/test data split

In [100]:
# Split the data into train and test
splits = training_final.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

### Oversampling on train set

Positive cases are so few that models could be unable to fully learn the patterns. To solve class imbalance, oversampling is used.

In [96]:
def oversample_minority(df, ratio=1):
    '''
    ratio is the ratio of majority to minority
    Eg. ratio 1 is equivalent to majority:minority = 1:1
    ratio 5 is equivalent to majority:minority = 5:1
    '''
    minority_count = df.filter("frontpage=1").count()
    majority_count = df.filter("frontpage=0").count()
    
    balance_ratio = majority_count / minority_count
    
    print(f"Initial Majority:Minority ratio is {balance_ratio:.2f}:1")
    if ratio >= balance_ratio:
        print("No oversampling of minority was done as the input ratio was more than or equal to the initial ratio.")
    else:
        print(f"Oversampling of minority done such that Majority:Minority ratio is {ratio}:1")
    
    oversampled_minority = df.filter("frontpage=1")\
                                .sample(withReplacement=True, fraction=(balance_ratio/ratio),seed=88)
    oversampled_df = df.filter("frontpage=0").union(oversampled_minority)
    
    return oversampled_df

In [97]:
training_final = oversample_minority(training, ratio=1)

Initial Majority:Minority ratio is 5.28:1
Oversampling of minority done such that Majority:Minority ratio is 1:1


In [98]:
training_final.show()

+---------+--------------------+
|frontpage|            features|
+---------+--------------------+
|        0|(262144,[161,440,...|
|        0|(262144,[1578,212...|
|        0|(262144,[654,2139...|
|        0|(262144,[1296,333...|
|        0|(262144,[101,873,...|
|        0|(262144,[6,303,33...|
|        0|(262144,[303,1578...|
|        0|(262144,[619,666,...|
|        0|(262144,[161,2701...|
|        0|(262144,[921,991,...|
|        0|(262144,[1152,213...|
|        0|(262144,[2701,332...|
|        0|(262144,[1007,300...|
|        0|(262144,[161,329,...|
|        0|(262144,[154,161,...|
|        0|(262144,[1277,230...|
|        0|(262144,[706,873,...|
|        0|(262144,[1179,230...|
|        0|(262144,[1546,356...|
|        0|(262144,[161,991,...|
+---------+--------------------+
only showing top 20 rows



In [99]:
print('Count of positive cases:', training_final.select('frontpage').where(training_final.frontpage==0).count())
print('Count of negative cases:', training_final.select('frontpage').where(training_final.frontpage==1).count())

Count of positive cases: 665
Count of negative cases: 704


## Model training and evaluation

### Model 1: Naive Bayes

In [101]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial", featuresCol='features', labelCol='frontpage')

# train the model
nbm = nb.fit(train)

In [102]:
# select example rows to display.
predictions = nbm.transform(test)
predictions.show()

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="frontpage", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

+---------+--------------------+--------------------+--------------------+----------+
|frontpage|            features|       rawPrediction|         probability|prediction|
+---------+--------------------+--------------------+--------------------+----------+
|        0|      (262144,[],[])|[-0.7485769854992...|[0.47303921568627...|       1.0|
|        0|      (262144,[],[])|[-0.7485769854992...|[0.47303921568627...|       1.0|
|        0|(262144,[154,161,...|[-94712.499142515...|[1.0,7.6786343743...|       0.0|
|        0|(262144,[161,329,...|[-27295.348303823...|[0.99999999891211...|       0.0|
|        0|(262144,[161,991,...|[-21942.381763917...|[3.48197865980789...|       1.0|
|        0|(262144,[161,2409...|[-8071.3385806557...|[0.99999999116115...|       0.0|
|        0|(262144,[216,1125...|[-6643.6679756373...|[0.99999993261046...|       0.0|
|        0|(262144,[329,424,...|[-8361.2432380027...|[1.06902030005976...|       1.0|
|        0|(262144,[329,467,...|[-83552.731576905...|[

In [103]:
# Save the model to the local file, please define own local directory here

model_path = 'C:/Users/Admin/Advanced Analytics for Bid Data World/Assignment 3/models/naive_bayes'
naive_bayes = nbm.save(model_path)

## Model deployment

There are two goals:
(1) save the model
(2) preprocessing the incoming message.

In [40]:
import threading

# Helper thread to avoid the Spark StreamingContext from blocking Jupyter
        
class StreamingThread(threading.Thread):
    def __init__(self, ssc):
        super().__init__()
        self.ssc = ssc
    def run(self):
        self.ssc.start()
        self.ssc.awaitTermination()
    def stop(self):
        print('----- Stopping... this may take a few seconds -----')
        self.ssc.stop(stopSparkContext=False, stopGraceFully=True)

In [41]:
from pyspark.streaming import StreamingContext
from pyspark.sql import Row
from pyspark.sql.functions import udf, struct, array, col, lit
from pyspark.sql.types import StringType, FloatType

In [42]:
from pyspark.ml.classification import NaiveBayesModel

In [104]:
globals()['models_loaded'] = False
globals()['my_model'] = None

# Define the prediction function
#def predict(df):
    #return globals()['my_model'].transform(df)

#predict_udf = udf(predict, FloatType())

# The final function
def process(time, rdd):
    if rdd.isEmpty():
        return
    
    print("========= %s =========" % str(time))
    
    # Convert to data frame
    df_stream = spark.read.json(rdd)
    df_stream = df_stream.select(col('source_text'), col('frontpage'))
    #df_stream.show()
    
    # Remove punctuations
    df_punc_drop_stream = df_stream.withColumn('source_text', regexp_replace(df_stream.source_text, '[^a-zA-Z0-9]', ' '))
    
    # Transformed with tokens
    df_token_stream = Tokenizer(inputCol="source_text", outputCol="tokens").transform(df_punc_drop_stream)
    #df_token_stream.show()
    
    # Remove stopwords
    df_clean_stream = stopwords.transform(df_token_stream)
    #df_clean_stream.show()
    
    # Apply Word2Vec
    #result_stream = word_embed.transform(df_clean_stream)

    # Apply TF-IDF
    featurizedData_stream = hashingTF.transform(df_clean_stream)
    rescaledData_stream = idfModel.transform(featurizedData_stream)
    
    # Finalized the training data
    training_stream = rescaledData_stream.drop('source_text','tokens','words')
    #training_stream.show()

    # Make predictions with the selected model
    if not globals()['models_loaded']:
        # load in your models here
        globals()['my_model'] = NaiveBayesModel.load(model_path)
        globals()['models_loaded'] = True
        
    # And then predict using the loaded model (uncomment below):
    
    df_result = globals()['my_model'].transform(training_stream)
    df_result.show()

## Streaming prediction

In [105]:
ssc = StreamingContext(sc, 10)

In [106]:
lines = ssc.socketTextStream("seppe.net", 7778)
lines.foreachRDD(process)

In [107]:
ssc_t = StreamingThread(ssc)
ssc_t.start()

+---------+--------------------+--------------------+--------------------+--------------------+----------+
|frontpage|         rawFeatures|            features|       rawPrediction|         probability|prediction|
+---------+--------------------+--------------------+--------------------+--------------------+----------+
|    false|(262144,[161,1546...|(262144,[161,1546...|[-4835.5888549435...|[1.81820577465556...|       1.0|
|     true|(262144,[300,654,...|(262144,[300,654,...|[-27998.635831693...|[1.0,5.2292637938...|       0.0|
+---------+--------------------+--------------------+--------------------+--------------------+----------+

+---------+--------------------+--------------------+--------------------+--------------------+----------+
|frontpage|         rawFeatures|            features|       rawPrediction|         probability|prediction|
+---------+--------------------+--------------------+--------------------+--------------------+----------+
|    false|(262144,[991,1096...|(262

In [108]:
ssc_t.stop()

----- Stopping... this may take a few seconds -----
+---------+--------------------+--------------------+--------------------+--------------------+----------+
|frontpage|         rawFeatures|            features|       rawPrediction|         probability|prediction|
+---------+--------------------+--------------------+--------------------+--------------------+----------+
|    false|(262144,[182407,1...|(262144,[182407,1...|[-149.85850097250...|[0.74272319138639...|       0.0|
|     true|(262144,[1512,177...|(262144,[1512,177...|[-5164.6777065180...|[1.0,1.2552073679...|       0.0|
|    false|(262144,[216,921,...|(262144,[216,921,...|[-12447.752652047...|[1.0,5.5545283786...|       0.0|
|    false|(262144,[1043,120...|(262144,[1043,120...|[-33190.357608938...|[1.0,3.1690385415...|       0.0|
|    false|(262144,[968,1303...|(262144,[968,1303...|[-26542.874416241...|      [1.0,4.4E-323]|       0.0|
+---------+--------------------+--------------------+--------------------+------------------