# Exercise 2 - Text Processing and Classification using Spark

## Part 2 

Convert the review texts to a classic vector space representation with TFIDF-weighted features based on the Spark DataFrame/Dataset API by building a transformation pipeline. The primary goal of this part is the preparation of the pipeline for Part 3 (see below). Note: although parts of this pipeline will be very similar to Assignment 1 or Part 1 above, do not expect to obtain identical results or have access to all intermediate outputs to compare the individual steps.

Use built-in functions for tokenization to unigrams at **whitespaces, tabs, digits, and the delimiter characters, casefolding, stopword removal, TF-IDF calculation, and chi square selection** (using 2000 top terms overall). Write the terms selected this way to a file **output_ds.txt** and compare them with the terms selected in Assignment 1. Describe your observations briefly in the submission report (see Part 3).

In [1]:
#importing libraries

from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer, IDF, ChiSqSelector,  StringIndexer, Normalizer
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.ml.stat import ChiSquareTest
from pyspark import SparkContext
from pyspark import SparkConf
import json 
from operator import add
import re
from heapq import nlargest

from pyspark.ml.classification import LinearSVC,  OneVsRest
from pyspark.ml.regression import LinearRegression

In [2]:
#starting spark session

spark = SparkSession.builder.getOrCreate()

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
24/05/29 19:16:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/05/29 19:16:06 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [3]:
#we are using the review_devset from the cluster. After that we keep only the necessary columns

textDF = spark.read.json("hdfs:///user/dic24_shared/amazon-reviews/full/reviews_devset.json").createOrReplaceTempView("review")
df = spark.sql("SELECT category,reviewText FROM review")

                                                                                

In [4]:
df.show()

+--------------------+--------------------+
|            category|          reviewText|
+--------------------+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...|
|Patio_Lawn_and_Garde|This is a very ni...|
|Patio_Lawn_and_Garde|The metal base wi...|
|Patio_Lawn_and_Garde|For the most part...|
|Patio_Lawn_and_Garde|This hose is supp...|
|Patio_Lawn_and_Garde|This tool works v...|
|Patio_Lawn_and_Garde|This product is a...|
|Patio_Lawn_and_Garde|I was excited to ...|
|Patio_Lawn_and_Garde|I purchased the L...|
|Patio_Lawn_and_Garde|Never used a manu...|
|Patio_Lawn_and_Garde|Good price. Good ...|
|Patio_Lawn_and_Garde|I have owned the ...|
|Patio_Lawn_and_Garde|I had "won" a sim...|
|Patio_Lawn_and_Garde|The birds ate all...|
|Patio_Lawn_and_Garde|Bought last summe...|
|Patio_Lawn_and_Garde|I knew I had a mo...|
|Patio_Lawn_and_Garde|I was a little wo...|
|Patio_Lawn_and_Garde|I have used this ...|
|Patio_Lawn_and_Garde|I actually do not...|
|Patio_Lawn_and_Garde|Just what 

In [5]:
#we have uploaded the stopword.txt to our cluster. Here, we're using that.

stopwordsPath = "DIC2/stopwords.txt"
stopwords = spark.sparkContext.textFile(stopwordsPath).collect()

In [6]:
#assembling the pipeline. 
#we are using RegexTokenizer, because it's customizable

tokenizer = RegexTokenizer(inputCol="reviewText", outputCol="words", pattern="\\s+|\\d+|[()\\[\\]{}.,;!?:+=\\-_\"'`~#@&*%€$§\\/]+", toLowercase=True)
remover = StopWordsRemover(inputCol="words", outputCol="filtered", caseSensitive=False)
vectorizer = CountVectorizer(inputCol="filtered", outputCol="vectorized")
idf = IDF(inputCol="vectorized", outputCol="tfidf")
encoder = StringIndexer(inputCol="category", outputCol="label")
chi2000 = ChiSqSelector(featuresCol="tfidf", labelCol="label", outputCol="selected", numTopFeatures=2000)

In [7]:
#assembling the pipeline #2

pipeline = Pipeline().setStages([tokenizer, remover, vectorizer, idf, encoder, chi2000])

In [8]:
pipelineModel = pipeline.fit(df)
transformedData = pipelineModel.transform(df)

24/05/29 19:17:00 WARN DAGScheduler: Broadcasting large task binary with size 1060.2 KiB
24/05/29 19:17:13 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
24/05/29 19:17:14 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
24/05/29 19:17:24 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
                                                                                

In [9]:
#creating the output. It will contain the top 2000 term in alphabetical order

selectedFeatures = pipelineModel.stages[5].selectedFeatures
words = pipelineModel.stages[2].vocabulary

output = set()
for i in selectedFeatures:
    output.add(words[i])

sorted_output = sorted(list(output))

In [10]:
sorted_output

['access',
 'accessories',
 'account',
 'acoustic',
 'act',
 'acting',
 'action',
 'actions',
 'actor',
 'actors',
 'adapter',
 'addictive',
 'adjust',
 'adjustable',
 'adjustment',
 'admit',
 'adorable',
 'adult',
 'adults',
 'adventure',
 'adventures',
 'advertised',
 'advice',
 'age',
 'ages',
 'agree',
 'air',
 'album',
 'albums',
 'alive',
 'almost',
 'alone',
 'along',
 'alpha',
 'also',
 'although',
 'always',
 'amazing',
 'amazon',
 'america',
 'american',
 'among',
 'amp',
 'amusing',
 'analysis',
 'ancient',
 'android',
 'angle',
 'animals',
 'anime',
 'another',
 'answers',
 'antenna',
 'anyone',
 'apart',
 'app',
 'appeal',
 'apple',
 'applied',
 'apply',
 'applying',
 'appreciate',
 'approach',
 'apps',
 'arm',
 'around',
 'arrived',
 'art',
 'artist',
 'artists',
 'aspects',
 'assemble',
 'assembled',
 'assembly',
 'atmosphere',
 'attach',
 'attached',
 'attention',
 'attractive',
 'audience',
 'audio',
 'author',
 'authors',
 'auto',
 'automatically',
 'available',
 'awa

In [11]:
with open('output_ds.txt', 'w') as f:  
    f.write(str(re.sub(",|'|[0-9]|\[|\]|\.","", str(sorted_output))))
f.close()

## Part 3

In this part, you will train a text classifier from the features extracted in Part 2. The goal is to learn a model that can predict the product category from a review's text.

To this end, extend the pipeline from Part 2 such that a Support Vector Machine classifier is trained. Since we are dealing with multi-class problems, make sure to put a strategy in place that allows binary classifiers to be applicable. Apply vector length normalization before feeding the feature vectors into the classifier (use Normalizer with L2 norm).

Follow best practices for machine learning experiment design and investigate the effects of parameter settings using the functions provided by Spark:

- Split the review data into training, validation, and test set.

- Make experiments reproducible.

- Use a grid search for parameter optimization:

    - Compare chi square overall top 2000 filtered features with another, heavier filtering with much less dimensionality (see Spark ML documentation for options).

    - Compare different SVM settings by varying the regularization parameter (choose 3 different values), standardization of training features (2 values), and maximum number of iterations (2 values).

- Use the MulticlassClassificationEvaluator to estimate performance of your trained classifiers on the test set, using F1 measure as criterion.


In [12]:
df=transformedData

In [13]:
#downsampling because the dataframe is too big and we're getting too much warning about it. Also the training is superlong. To use the whole dataframe simply delete this cell

df=df.sample(fraction=0.01, seed=4242)

In [14]:
df2=df.select("label", "selected").toDF("label", "selected")

In [15]:
#as asked in the task we normalize the "selected" column with L2

normalizer = Normalizer().setInputCol("selected").setOutputCol("normalized").setP(2.0)
df_norm =normalizer.transform(df2)

In [16]:
#deleting unnecessary columns

df3=df_norm.select("label", "normalized").toDF("label", "normalized")

In [17]:
df3.show()

24/05/29 19:17:53 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB


+-----+--------------------+
|label|          normalized|
+-----+--------------------+
| 18.0|(2000,[0,2,5,18,8...|
| 18.0|(2000,[3,4,7,26,3...|
| 18.0|(2000,[7,34,58,68...|
| 18.0|(2000,[3,9,18,39,...|
| 18.0|(2000,[3,6,18,37,...|
| 18.0|(2000,[1,4,14,27,...|
| 18.0|(2000,[18,23,31,3...|
| 18.0|(2000,[27,52,56,6...|
| 18.0|(2000,[1,2,6,21,2...|
| 10.0|(2000,[1,3,9,28,3...|
| 10.0|(2000,[2,7,13,17,...|
| 10.0|(2000,[7,11,14,65...|
| 10.0|(2000,[25,29,31,7...|
| 10.0|(2000,[7,22,70,14...|
| 10.0|(2000,[10,14,36,8...|
| 10.0|(2000,[2,10,36],[...|
| 10.0|(2000,[6,124,155,...|
| 10.0|(2000,[3,17,30,97...|
| 10.0|(2000,[23,96,116,...|
| 10.0|(2000,[9,12,65,13...|
+-----+--------------------+
only showing top 20 rows



In [18]:
#splitting the data and making it reproducible
train,val, test = df3.randomSplit([0.8,0.1, 0.1], seed = 4242)

In [None]:
#setting up the classification. We use OneVSRest to be able to use LinearSVC for multiclass labeling

lsvc = LinearSVC(featuresCol="normalized", labelCol="label", maxIter=10)
ovr = OneVsRest(classifier=lsvc, featuresCol="normalized", labelCol="label")
ovr_model = ovr.fit(train)

24/05/29 19:17:54 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
24/05/29 19:17:57 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
24/05/29 19:17:58 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
24/05/29 19:17:59 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/05/29 19:17:59 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
24/05/29 19:17:59 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
24/05/29 19:17:59 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
24/05/29 19:18:00 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
24/05/29 19:18:00 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
24/05/29 19:18:00 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
24/05/29 19:18:00 WARN DAGS

In [None]:
#setting up the parameter grid for gridsearch. We tried to keep the iterationcount low, as it's already too slow.

grid = {
    "classifier__regParam": [0.001, 0.01, 0.1],
    "classifier__standardization": [True, False],
    "classifier__maxIter": [10, 8]
}

In [None]:
#for evaluation we use the recommended MulticlassClassificationEvaluator

evaluator=MulticlassClassificationEvaluator(metricName="f1")

In [None]:
#here we initalize the crossvalidator

cv=CrossValidator(estimator=ovr, estimatorParamMaps=grid, evaluator=evaluator, numFolds=2)

In [None]:
#for a "quick" run we use the validation dataset to find the best model using the crossvalidator

cv_model=cv.fit(val)

In [None]:
#getting our best model to compare to our original one

best_model=cv_model.bestModel

In [None]:
#calculating the original model's f1 score. we're using the test data

ovr_predictions_test = ovr_model.transform(test)
ovr_f1_score = evaluator.evaluate(ovr_predictions_test)
print(f"OVR F1 Score: {ovr_f1_score}")

In [None]:
#calculating the best model's f1 score. we're using the test data


best_model_predictions_test = best_model.transform(test)
best_model_f1_score = evaluator.evaluate(best_model_predictions_test)
print(f"Best Model F1 Score: {best_model_f1_score}")