## Part 2 Datasets/DataFrames: Spark ML and Pipelines
Convert the review texts to a classic vector space representation with TFIDF-weighted features based on the Spark DataFrame/Dataset API by building a transformation pipeline. The primary goal of this part is the preparation of the pipeline for Part 3 (see below). Note: although parts of this pipeline will be very similar to Assignment 1 or Part 1 above, do not expect to obtain identical results or have access to all intermediate outputs to compare the individual steps.

Use built-in functions for tokenization to unigrams at whitespaces, tabs, digits, and the delimiter characters ()[]{}.!?,;:+=-_"'`~#@&*%€$§\/, casefolding, stopword removal, TF-IDF calculation, and chi square selection ) (using 2000 top terms overall). Write the terms selected this way to a file output_ds.txt and compare them with the terms selected in Assignment 1. Describe your observations briefly in the submission report (see Part 3).

[Provided link for ML pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html)  
[Provided link for feature extraction](https://spark.apache.org/docs/latest/ml-features.html)

## Imports

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import UnivariateFeatureSelector, ChiSqSelectorModel 
from pyspark.sql import Row

from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, TrainValidationSplit
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.ml.feature import Normalizer, VectorSlicer

from datetime import datetime

In [2]:
import warnings
warnings.filterwarnings(action='once')

## Initiliaze Spark

In [3]:
# Initialize Spark context and session
conf = SparkConf().setAppName("Part2")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

SLF4J: Class path contains multiple SLF4J bindings.

In [4]:
spark

### Importing data
Stopwords file as well as test data

In [5]:
# Define the stopwords file and the counters file
stopwords_file = "stopwords.txt"

# Load stopwords into a set
with open(stopwords_file, "r") as f:
    stopwords = set(f.read().strip().split())
    
# Load and preprocess the Amazon reviews dataset
input_file = "hdfs:///user/dic24_shared/amazon-reviews/full/reviews_devset.json"
reviews_df = spark.read.json(input_file)

In [6]:
reviews_df = reviews_df.select("category", "reviewText")

# Show the DataFrame with selected columns
reviews_df.show()

+--------------------+--------------------+
|            category|          reviewText|
+--------------------+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...|
|Patio_Lawn_and_Garde|This is a very ni...|
|Patio_Lawn_and_Garde|The metal base wi...|
|Patio_Lawn_and_Garde|For the most part...|
|Patio_Lawn_and_Garde|This hose is supp...|
|Patio_Lawn_and_Garde|This tool works v...|
|Patio_Lawn_and_Garde|This product is a...|
|Patio_Lawn_and_Garde|I was excited to ...|
|Patio_Lawn_and_Garde|I purchased the L...|
|Patio_Lawn_and_Garde|Never used a manu...|
|Patio_Lawn_and_Garde|Good price. Good ...|
|Patio_Lawn_and_Garde|I have owned the ...|
|Patio_Lawn_and_Garde|I had "won" a sim...|
|Patio_Lawn_and_Garde|The birds ate all...|
|Patio_Lawn_and_Garde|Bought last summe...|
|Patio_Lawn_and_Garde|I knew I had a mo...|
|Patio_Lawn_and_Garde|I was a little wo...|
|Patio_Lawn_and_Garde|I have used this ...|
|Patio_Lawn_and_Garde|I actually do not...|
|Patio_Lawn_and_Garde|Just what 

## Create RegexTokenizer

In [7]:
pattern = r'\s+|\d+|[(){}\[\].!?,;:+=_"\'`~#@&*%€$§\\/\-]'
regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="words", pattern=pattern)

## Create StopWordsRemover

In [8]:
remover = StopWordsRemover(stopWords = list(stopwords), inputCol="words", outputCol="filtered_words")


## Create CountVectorizer

In [9]:
#hashingTF = HashingTF(inputCol="filtered_words", outputCol="rawFeatures")
countV = CountVectorizer(inputCol="filtered_words", outputCol="rawFeatures").setMinTF(1).setMinDF(3).setVocabSize(7500)


## Create IDF model

In [10]:
idf = IDF(inputCol="rawFeatures", outputCol="features")


## Create StringIndexer

In [11]:
# Apply StringIndexer to convert categorical column to numerical
indexer = StringIndexer(inputCol="category", outputCol="label")


## Create Selector
Here, we add a SelectionThreshold of 2000 to have the top 2000 terms overall.

In [12]:
selector = UnivariateFeatureSelector(
    featuresCol="features",outputCol="selectedFeatures",
    labelCol="label",  selectionMode="numTopFeatures",
)
selector.setFeatureType("categorical").setLabelType("categorical").setSelectionThreshold(2000)

UnivariateFeatureSelector_f19ffa3b2272

## Create Pipeline

In [13]:
pipeline = Pipeline(stages=[regexTokenizer, remover, countV, idf, indexer, selector])

In [14]:
model = pipeline.fit(reviews_df)

In [15]:
result = model.transform(reviews_df)

In [16]:
result.select("selectedFeatures").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|selectedFeatures                                                                              

## Get top 2000 terms

In [17]:
# Extract vocabulary from CountVectorizer
vocabulary = model.stages[2].vocabulary

# Map selected feature indices to terms
selected_terms = [vocabulary[i] for i in model.stages[-1].selectedFeatures]

In [18]:
len(selected_terms)

2000

In [19]:
vocabulary[:30] == selected_terms[:30]

False

In [20]:
type(selected_terms)

list

In [21]:
with open("output_ds.txt", "w") as f:
    for term in selected_terms:
        f.write(term + "\n")

# Part 3

In this part, you will train a text classifier from the features extracted in Part 2. The goal is to learn a model that can predict the product category from a review's text.

To this end, extend the pipeline from Part 2 such that a Support Vector Machine classifier is trained. Since we are dealing with multi-class problems, make sure to put a strategy in place that allows binary classifiers to be applicable. Apply vector length normalization before feeding the feature vectors into the classifier (use Normalizer with L2 norm).

Follow best practices for machine learning experiment design and investigate the effects of parameter settings using the functions provided by Spark:

- Split the review data into training, validation, and test set.

- Make experiments reproducible.

- Use a grid search for parameter optimization:

- Compare chi square overall top 2000 filtered features with another, heavier filtering with much less dimensionality (see Spark ML documentation for options).

- Compare different SVM settings by varying the regularization parameter (choose 3 different values), standardization of training features (2 values), and maximum number of iterations (2 values).

Use the MulticlassClassificationEvaluator to estimate performance of your trained classifiers on the test set, using F1 measure as criterion.

## Split into Train/Test/Val and set seed

In [22]:
(reviews_train, reviews_test) = reviews_df.randomSplit([0.7, 0.3], 123)

In [23]:
reviews_train.count()

55211

In [24]:
reviews_test.count()

23618

## Extend Pipeline

In [25]:
normalizer = Normalizer(inputCol="selectedFeatures", outputCol="normalized_features")

In [26]:
#scaler = StandardScaler(inputCol="normalized_features", outputCol="scaled_features", withMean=False)

In [27]:
svm = LinearSVC(featuresCol="normalized_features")
ovr = OneVsRest(classifier=svm)

In [28]:
extended_pipeline = Pipeline(stages=[regexTokenizer, remover, countV, idf, indexer, selector, normalizer, ovr])

## Define Evaluator

In [29]:
evaluator = MulticlassClassificationEvaluator(metricName="f1")

## Perform Grid Search using different SVM settings

Create a param grid to train and evaluate pipeline. Use TrainValidationSplit for single model per combination.

In [30]:
param_grid = ParamGridBuilder() \
    .addGrid(svm.regParam, [0.1, 0.01, 0.001]) \
    .addGrid(svm.maxIter, [10, 20]) \
    .addGrid(svm.standardization, [True, False]) \
    .addGrid(selector.selectionThreshold, [2000, 250]) \
    .build()

validator = TrainValidationSplit(estimator=extended_pipeline, estimatorParamMaps=param_grid, evaluator=evaluator, trainRatio = 0.7, parallelism = 4, seed=123)

Perform a grid search on a subset of the training data to approximate the best parameters.

In [None]:
subset_train = reviews_train.randomSplit([0.1, 0.9], 123)

start_time = datetime.now()
cv_model = validator.fit(subset_train[0])
end_time = datetime.now()

#predictions = cv_model.transform(reviews_test)

#f1_score = evaluator.evaluate(predictions)
#print(f"F1 Score: {f1_score}")

  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  _warn(f"unclosed running multiprocessing pool {self!r}",
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None
  self._sock = None


In [None]:
print("start time:", start_time)
print("end time:", end_time)

In [None]:
# Get the best model
bestModel = cv_model.bestModel

# Show the best parameters
print("Best Params:")
print("RegParam:", bestModel.stages[-1].models[0].getOrDefault("regParam"))
print("MaxIter:", bestModel.stages[-1].models[0].getOrDefault("maxIter"))
print("P value:", bestModel.stages[-2].getOrDefault("p"))
print("P value:", bestModel.stages[-3].getOrDefault("selectionThreshold"))

## Compare Chi Square with another filtering method