# Exercise 2 - Text Processing and Classification using Spark

## Part 2 

Convert the review texts to a classic vector space representation with TFIDF-weighted features based on the Spark DataFrame/Dataset API by building a transformation pipeline. The primary goal of this part is the preparation of the pipeline for Part 3 (see below). Note: although parts of this pipeline will be very similar to Assignment 1 or Part 1 above, do not expect to obtain identical results or have access to all intermediate outputs to compare the individual steps.

Use built-in functions for tokenization to unigrams at **whitespaces, tabs, digits, and the delimiter characters, casefolding, stopword removal, TF-IDF calculation, and chi square selection** (using 2000 top terms overall). Write the terms selected this way to a file **output_ds.txt** and compare them with the terms selected in Assignment 1. Describe your observations briefly in the submission report (see Part 3).

In [3]:
#importing libraries

from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer, IDF, ChiSqSelector,  StringIndexer, Normalizer
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.ml.stat import ChiSquareTest
from pyspark import SparkContext
from pyspark import SparkConf
import json 
from operator import add
import re
from heapq import nlargest

from pyspark.ml.classification import LinearSVC,  OneVsRest
from pyspark.ml.regression import LinearRegression

In [None]:
#starting spark session

spark = SparkSession.builder.getOrCreate()

24/05/29 20:58:52 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.ClientServ

In [None]:
#we are using the review_devset from the cluster. After that we keep only the necessary columns
# load the data from hadoop and making temprary view
# selecting category and reviewText from this view

textDF = spark.read.json("hdfs:///user/dic24_shared/amazon-reviews/full/reviews_devset.json").createOrReplaceTempView("review")
df = spark.sql("SELECT category,reviewText FROM review")

In [None]:
df.show()

In [None]:
#we have uploaded the stopword.txt to our cluster. Here, we're using that.

stopwordsPath = "Exercise2/stopwords.txt"
# reading the contents of the stopwords file into an RDD and with collect collecting all the elements of it and returning as a list
stopwords = spark.sparkContext.textFile(stopwordsPath).collect()

In [None]:
#assembling the pipeline. 
# we use regextokenizer to tokenize and low case the words
# we use StopWordsRemover to remove the stopwords
# we use CounteVectorizer to Vectorize and Count them
# we use Inverse Document Frequency, a component of the TF-IDF scoring mechanism
# we convert categorical strings in the "category" column into numerical labels in the "label" column
# we use ChiSqSelector to Selects the top 2000 features from the "tfidf" column that are most relevant to the "label" column, and stores them in the "selected" column

tokenizer = RegexTokenizer(inputCol="reviewText", outputCol="words", pattern="\\s+|\\d+|[()\\[\\]{}.,;!?:+=\\-_\"'`~#@&*%€$§\\/]+", toLowercase=True)
remover = StopWordsRemover(inputCol="words", stopWords=stopwords,outputCol="filtered", caseSensitive=False)
vectorizer = CountVectorizer(inputCol="filtered", outputCol="vectorized")
idf = IDF(inputCol="vectorized", outputCol="tfidf")
encoder = StringIndexer(inputCol="category", outputCol="label")
chi2000 = ChiSqSelector(featuresCol="tfidf", labelCol="label", outputCol="selected", numTopFeatures=2000)

In [None]:
# we create a pipeline that sequentially applies the series of transformations and feature selection steps that we created 

pipeline = Pipeline().setStages([tokenizer, remover, vectorizer, idf, encoder, chi2000])

In [None]:
# fits the pipeline to the df
pipelineModel = pipeline.fit(df)
# transforms the original df using the learned transformations
transformedData = pipelineModel.transform(df)

In [None]:
# Select the selected features from the ChiSqSelector
selectedFeatures = pipelineModel.stages[5].selectedFeatures
# Select the list of words that the vectorizer has indexed and used to create the term frequency vectors
words = pipelineModel.stages[2].vocabulary

# find the respective words and add them to output
output = set()
for i in selectedFeatures:
    output.add(words[i])

#Sorted list of words corresponding to the selected features
sorted_output = sorted(list(output))

In [None]:
sorted_output

In [None]:
# Save the ouput
with open('output_ds.txt', 'w') as f:  
    f.write(str(re.sub(",|'|[0-9]|\[|\]|\.","", str(sorted_output))))
f.close()

## Part 3

In this part, you will train a text classifier from the features extracted in Part 2. The goal is to learn a model that can predict the product category from a review's text.

To this end, extend the pipeline from Part 2 such that a Support Vector Machine classifier is trained. Since we are dealing with multi-class problems, make sure to put a strategy in place that allows binary classifiers to be applicable. Apply vector length normalization before feeding the feature vectors into the classifier (use Normalizer with L2 norm).

Follow best practices for machine learning experiment design and investigate the effects of parameter settings using the functions provided by Spark:

- Split the review data into training, validation, and test set.

- Make experiments reproducible.

- Use a grid search for parameter optimization:

    - Compare chi square overall top 2000 filtered features with another, heavier filtering with much less dimensionality (see Spark ML documentation for options).

    - Compare different SVM settings by varying the regularization parameter (choose 3 different values), standardization of training features (2 values), and maximum number of iterations (2 values).

- Use the MulticlassClassificationEvaluator to estimate performance of your trained classifiers on the test set, using F1 measure as criterion.


In [None]:
df=transformedData

In [None]:
#downsampling because the dataframe is too big and we're getting too much warning about it
# Also the training is superlong
# To use the whole dataframe simply delete this cell

df=df.sample(fraction=0.01, seed=4242)

In [None]:
# Selecting label and selected columns from df
df2=df.select("label", "selected").toDF("label", "selected")

In [None]:
#as asked in the task we normalize the "selected" column with L2
# Sets the normalization parameter to 2.0, which corresponds to L2 normalization

normalizer = Normalizer().setInputCol("selected").setOutputCol("normalized").setP(2.0)
df_norm =normalizer.transform(df2)

In [None]:
#deleting unnecessary columns

df3=df_norm.select("label", "normalized").toDF("label", "normalized")

In [None]:
df3.show()

In [None]:
#splitting the data and making it reproducible
# splitting it randomly:
# train: Contains 70% of the data
# val: Contains 15% of the data
# test: Contains another 15% of the data

train,val, test = df3.randomSplit([0.7,0.15, 0.15], seed = 4242)

In [None]:
# Initializes a Linear Support Vector Classifier (LinearSVC) for binary classification and then 
# Sets up the One-vs-Rest (OvR) strategy for multi-class classification
# Fits the OvR model to the training data
lsvc = LinearSVC(featuresCol="normalized", labelCol="label", maxIter=10)
ovr = OneVsRest(classifier=lsvc, featuresCol="normalized", labelCol="label")
ovr_model = ovr.fit(train)

In [None]:
#setting up the parameter grid for gridsearch. We tried to keep the iteration count low, as it's already too slow.
#We need nested parameters because we also nested the classifiers
# This dictionary is containing hyperparameters
# classifier__regParam: This parameter controls the regularization strength of the classifier
# classifier__standardization: This parameter determines whether or not to standardize the features before training the model
# classifier__maxIter: This parameter controls the maximum number of iterations for optimization algorithms.
param_grid_dict = {
    "classifier__regParam": [0.001, 0.01, 0.1],
    "classifier__standardization": [True, False],
    "classifier__maxIter": [10, 8]
}

In [None]:
# ParamGridBuilder(): Initializes a builder for parameter grids
param_grid_builder = ParamGridBuilder()
# iterates over each hyperparameter and its corresponding values in the dictionary
# getattr(lsvc, param.split("__")[1]) select the attribute of the LinearSVC based on the parameter name
# It dynamically accesses the attribute corresponding to the parameter name by splitting the parameter name using __ selecting the second part
# Values are the hyperparameter values specified in param_grid_dict for the current parameter
for param, values in param_grid_dict.items():   
    param_grid_builder = param_grid_builder.addGrid(getattr(lsvc, param.split("__")[1]), values)

# Building the parameter grid using the added grids
param_grid = param_grid_builder.build()

In [None]:
#For evaluation we use the recommended MulticlassClassificationEvaluator
# MulticlassClassificationEvaluator is an evaluator used to evaluate multiclass classification models
# The F1 score is the harmonic mean of precision and recall
evaluator=MulticlassClassificationEvaluator(metricName="f1")

In [None]:
#Here we initalize the crossvalidator
# numFolds: The number of folds in cross-validation (2-fold cross-validation)
cv=CrossValidator(estimator=ovr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=2)

In [None]:
val.groupBy("label").count().show()

In [None]:
#For a "quick" run we use the validation dataset to find the best model using the crossvalidator

cv_model=cv.fit(val)

In [None]:
#Getting our best model to compare to our original one
best_model=cv_model.bestModel

In [None]:
#Calculating the original model's f1 score. Using the test data

ovr_predictions_test = ovr_model.transform(test)
ovr_f1_score = evaluator.evaluate(ovr_predictions_test)
print(f"OVR F1 Score: {ovr_f1_score}")

In [None]:
#calculating the best model's f1 score. we're using the test data
best_model_predictions_test = best_model.transform(test)
best_model_f1_score = evaluator.evaluate(best_model_predictions_test)
print(f"Best Model F1 Score: {best_model_f1_score}")