## Part 2 Datasets/DataFrames: Spark ML and Pipelines
Convert the review texts to a classic vector space representation with TFIDF-weighted features based on the Spark DataFrame/Dataset API by building a transformation pipeline. The primary goal of this part is the preparation of the pipeline for Part 3 (see below). Note: although parts of this pipeline will be very similar to Assignment 1 or Part 1 above, do not expect to obtain identical results or have access to all intermediate outputs to compare the individual steps.

Use built-in functions for tokenization to unigrams at whitespaces, tabs, digits, and the delimiter characters ()[]{}.!?,;:+=-_"'`~#@&*%€$§\/, casefolding, stopword removal, TF-IDF calculation, and chi square selection ) (using 2000 top terms overall). Write the terms selected this way to a file output_ds.txt and compare them with the terms selected in Assignment 1. Describe your observations briefly in the submission report (see Part 3).

[Provided link for ML pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html)  
[Provided link for feature extraction](https://spark.apache.org/docs/latest/ml-features.html)

## Imports

In [39]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import IDF

from pyspark.ml.feature import RegexTokenizer

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import UnivariateFeatureSelector

from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.classification import OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.ml.feature import Normalizer

from datetime import datetime

In [40]:
import warnings
warnings.filterwarnings(action='once')

## Initialize Spark

In [41]:
# Initialize Spark context and session
conf = SparkConf().setAppName("Part2")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

24/05/25 12:16:12 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/05/25 12:16:12 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/05/25 12:16:12 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [42]:
spark

### Importing data
Stopwords file as well as test data

In [43]:
# Define the stopwords file and the counters file
stopwords_file = "stopwords.txt"

# Load stopwords into a set
with open(stopwords_file, "r") as f:
    stopwords = set(f.read().strip().split())
    
# Load and preprocess the Amazon reviews dataset
input_file = "hdfs:///user/dic24_shared/amazon-reviews/full/reviews_devset.json"
reviews_df = spark.read.json(input_file)

                                                                                

In [44]:
reviews_df = reviews_df.select("category", "reviewText")

# Show the DataFrame with selected columns
reviews_df.show()

+--------------------+--------------------+
|            category|          reviewText|
+--------------------+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...|
|Patio_Lawn_and_Garde|This is a very ni...|
|Patio_Lawn_and_Garde|The metal base wi...|
|Patio_Lawn_and_Garde|For the most part...|
|Patio_Lawn_and_Garde|This hose is supp...|
|Patio_Lawn_and_Garde|This tool works v...|
|Patio_Lawn_and_Garde|This product is a...|
|Patio_Lawn_and_Garde|I was excited to ...|
|Patio_Lawn_and_Garde|I purchased the L...|
|Patio_Lawn_and_Garde|Never used a manu...|
|Patio_Lawn_and_Garde|Good price. Good ...|
|Patio_Lawn_and_Garde|I have owned the ...|
|Patio_Lawn_and_Garde|I had "won" a sim...|
|Patio_Lawn_and_Garde|The birds ate all...|
|Patio_Lawn_and_Garde|Bought last summe...|
|Patio_Lawn_and_Garde|I knew I had a mo...|
|Patio_Lawn_and_Garde|I was a little wo...|
|Patio_Lawn_and_Garde|I have used this ...|
|Patio_Lawn_and_Garde|I actually do not...|
|Patio_Lawn_and_Garde|Just what 

## Create RegexTokenizer

In [45]:
pattern = r'\s+|\d+|[(){}\[\].!?,;:+=_"\'`~#@&*%€$§\\/\-]'
regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="words", pattern=pattern)

## Create StopWordsRemover

In [46]:
remover = StopWordsRemover(stopWords = list(stopwords), inputCol="words", outputCol="filtered_words")

## Create CountVectorizer

In [47]:
#hashingTF = HashingTF(inputCol="filtered_words", outputCol="rawFeatures")
countV = CountVectorizer(inputCol="filtered_words", outputCol="rawFeatures").setMinTF(1).setMinDF(3).setVocabSize(7500)

## Create IDF model

In [48]:
idf = IDF(inputCol="rawFeatures", outputCol="features")

## Create StringIndexer

In [49]:
# Apply StringIndexer to convert categorical column to numerical
indexer = StringIndexer(inputCol="category", outputCol="label")

## Create Selector
Here, we add a SelectionThreshold of 2000 to have the top 2000 terms overall.

In [50]:
selector = UnivariateFeatureSelector(
    featuresCol="features",outputCol="selectedFeatures",
    labelCol="label",  selectionMode="numTopFeatures",
)
selector.setFeatureType("categorical").setLabelType("categorical").setSelectionThreshold(2000)

UnivariateFeatureSelector_4f0c0881568c

## Create Pipeline

In [51]:
pipeline = Pipeline(stages=[regexTokenizer, remover, countV, idf, indexer, selector])

In [52]:
model = pipeline.fit(reviews_df)

                                                                                

In [53]:
result = model.transform(reviews_df)

In [54]:
result.select("selectedFeatures").show(truncate=True)

+--------------------+
|    selectedFeatures|
+--------------------+
|(2000,[2,3,7,8,35...|
|(2000,[0,1,3,21,3...|
|(2000,[4,10,174,3...|
|(2000,[1,3,4,9,10...|
|(2000,[12,29,101,...|
|(2000,[0,3,4,8,11...|
|(2000,[18,112,175...|
|(2000,[6,21,32,36...|
|(2000,[3,4,5,6,40...|
|(2000,[6,8,38,78,...|
|(2000,[1,13,226],...|
|(2000,[5,17,33,40...|
|(2000,[1,11,28,35...|
|(2000,[40,144,339...|
|(2000,[0,3,7,9,11...|
|(2000,[8,26,57,80...|
|(2000,[1,15,120,1...|
|(2000,[2,3,221,26...|
|(2000,[4,10,16,20...|
|(2000,[0,18,30,42...|
+--------------------+
only showing top 20 rows



## Get top 2000 terms

In [55]:
# Extract vocabulary from CountVectorizer
vocabulary = model.stages[2].vocabulary

# Map selected feature indices to terms
selected_terms = [vocabulary[i] for i in model.stages[-1].selectedFeatures]

In [56]:
len(selected_terms)

2000

In [57]:
vocabulary[:30] == selected_terms[:30]

False

In [58]:
type(selected_terms)

list

In [59]:
with open("output_ds.txt", "w") as f:
    for term in selected_terms:
        f.write(term + "\n")

# Part 3

In this part, you will train a text classifier from the features extracted in Part 2. The goal is to learn a model that can predict the product category from a review's text.

To this end, extend the pipeline from Part 2 such that a Support Vector Machine classifier is trained. Since we are dealing with multi-class problems, make sure to put a strategy in place that allows binary classifiers to be applicable. Apply vector length normalization before feeding the feature vectors into the classifier (use Normalizer with L2 norm).

Follow best practices for machine learning experiment design and investigate the effects of parameter settings using the functions provided by Spark:

- Split the review data into training, validation, and test set.

- Make experiments reproducible.

- Use a grid search for parameter optimization:

- Compare chi square overall top 2000 filtered features with another, heavier filtering with much less dimensionality (see Spark ML documentation for options).

- Compare different SVM settings by varying the regularization parameter (choose 3 different values), standardization of training features (2 values), and maximum number of iterations (2 values).

Use the MulticlassClassificationEvaluator to estimate performance of your trained classifiers on the test set, using F1 measure as criterion.

## Split into Train/Test/Val and set seed

In [60]:
(reviews_train, reviews_test) = reviews_df.randomSplit([0.9, 0.1], 123)

In [61]:
reviews_train.count()

                                                                                

71031

In [62]:
reviews_test.count()

7798

## Extend Pipeline

In [63]:
normalizer = Normalizer(inputCol="selectedFeatures", outputCol="normalized_features")

In [64]:
svm = LinearSVC(featuresCol="normalized_features")
ovr = OneVsRest(classifier=svm)

In [65]:
extended_pipeline = Pipeline(stages=[regexTokenizer, remover, countV, idf, indexer, selector, normalizer, ovr])

## Define Evaluator

In [66]:
evaluator = MulticlassClassificationEvaluator(metricName="f1")

## Perform Grid Search using different SVM settings

Create a param grid to train and evaluate pipeline. Use TrainValidationSplit for single model per combination.

In [67]:
param_grid = ParamGridBuilder() \
    .addGrid(svm.regParam, [0.1, 0.01, 0.001]) \
    .addGrid(svm.maxIter, [10, 20]) \
    .addGrid(svm.standardization, [True, False]) \
    .addGrid(selector.selectionThreshold, [2000, 250]) \
    .build()

validator = TrainValidationSplit(estimator=extended_pipeline, estimatorParamMaps=param_grid, evaluator=evaluator, trainRatio = 0.7, parallelism = 4, seed=123)

Perform a grid search on a subset of the training data to approximate the best parameters.

In [68]:
start_time = datetime.now()
cv_model = validator.fit(reviews_train)
end_time = datetime.now()

  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  _warn(f"unclosed running multiprocessing pool {self!r}",
24/05/25 12:29:27 WARN DAGScheduler: Broadcasting large task binary with size 1822.2 KiB
24/05/25 12:29:27 WARN DAGScheduler: Broadcasting large task binary with size 1822.2 KiB
24/05/25 12:29:27 WARN DAGScheduler: Broadcasting large task binary with size 1822.2 KiB
24/05/25 12:29:27 WARN DAGScheduler: Broadcasting large task binary with size 1822.2 KiB
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
24/05/25 12:42:43 WARN DAGScheduler: Broadcasting large task binary with size 1822.2 KiB
24/05/25 12:42:44 WARN DAGScheduler: Broadcasting large task binary with size 1822.2 KiB
24/05/25 12:42:44 WARN DAGScheduler: Broadcasting large task binary with size 1822.2 KiB
24/05/25 12:42:44 WARN DAGScheduler: Broadcasting large task binary with size 1822.2 KiB
  self._sock = None
  self._sock = None
  self._sock = None
  self._s

In [69]:
print("start time:", start_time)
print("end time:", end_time)
print("time elapsed:", end_time - start_time)

print("\nno. reviews used to train:", reviews_train.count())
print("no. reviews used to test:", reviews_test.count())

start time: 2024-05-25 12:17:08.399678
end time: 2024-05-25 13:39:32.794385
time elapsed: 1:22:24.394707

no. reviews used to train: 71031
no. reviews used to test: 7798


In [70]:
# Get the best model
bestModel = cv_model.bestModel

# Show the best parameters
print("Best Params:")
print("RegParam:", bestModel.stages[-1].models[0].getOrDefault("regParam"))
print("MaxIter:", bestModel.stages[-1].models[0].getOrDefault("maxIter"))
print("Standardization:", bestModel.stages[-1].models[0].getOrDefault("standardization"))
print("P value:", bestModel.stages[-3].getOrDefault("selectionThreshold"))

Best Params:
RegParam: 0.01
MaxIter: 10
Standardization: False
P value: 2000.0


In [71]:
predictions = cv_model.transform(reviews_test)

f1_score = evaluator.evaluate(predictions)
print(f"F1 Score: {f1_score}")

24/05/25 13:39:38 WARN DAGScheduler: Broadcasting large task binary with size 1819.0 KiB

F1 Score: 0.630127489411732


                                                                                

In [72]:
spark.stop()