# Exercise 2: Text Processing and Classification using Spark
### Data-intensive Computing 2023, Group 15

## Part 1) RDDs

Repeat the steps of Assignment 1, i.e. calculation of chi-square values and output of the sorted top terms per category, as well as the joined dictionary, using RDDs and transformations. Write the output to a file output_rdd.txt. Compare the generated output_rdd.txt with your generated output.txt from Assignment 1 and describe your observations briefly in the submission report (see Part 3).

### Libraries

In [1]:
from pyspark import SparkContext
from pyspark import SparkConf
import json 
from operator import add
import re
from heapq import nlargest

In [2]:
sc = SparkContext.getOrCreate(SparkConf())

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/05/29 11:58:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/05/29 11:58:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/05/29 11:59:00 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


### Data loading

Note that we could have also loaded the nltk library and downloaded the stopwords. Since this takes time, and efficiency is part of the solution, we have decided to load them directly:

In [3]:
stopwords = ["a" ,"aa" ,"able" ,"about" ,"above" ,"absorbs" ,"accord" ,"according" ,"accordingly" ,"across" ,"actually" ,"after" ,"afterwards" ,"again" ,"against" ,"ain" ,"album" ,"album" ,"all" ,"allow" ,"allows" ,"almost" ,"alone" ,"along" ,"already" ,"also" ,"although" ,"always" ,"am" ,"among" ,"amongst" ,"an" ,"and" ,"another" ,"any" ,"anybody" ,"anyhow" ,"anyone" ,"anything" ,"anyway" ,"anyways" ,"anywhere" ,"apart" ,"app" ,"appear" ,"appreciate" ,"appropriate" ,"are" ,"aren" ,"around" ,"as" ,"aside" ,"ask" ,"asking" ,"associated" ,"at" ,"available" ,"away" ,"awfully" ,"b" ,"baby" ,"bb" ,"be" ,"became" ,"because" ,"become" ,"becomes" ,"becoming" ,"been" ,"before" ,"beforehand" ,"behind" ,"being" ,"believe" ,"below" ,"beside" ,"besides" ,"best" ,"better" ,"between" ,"beyond" ,"bibs" ,"bike" ,"book" ,"books" ,"both" ,"brief" ,"bulbs" ,"but" ,"by" ,"c" ,"came" ,"camera" ,"can" ,"cannot" ,"cant" ,"car" ,"case" ,"cause" ,"causes" ,"cd" ,"certain" ,"certainly" ,"changes" ,"clearly" ,"co" ,"coffee" ,"com" ,"come" ,"comes" ,"concerning" ,"consequently" ,"consider" ,"considering" ,"contain" ,"containing" ,"contains" ,"corresponding" ,"could" ,"couldn" ,"course" ,"currently" ,"d" ,"definitely" ,"described" ,"despite" ,"did" ,"didn" ,"different" ,"do" ,"does" ,"doesn" ,"dog" ,"dogs" ,"doing" ,"doll" ,"don" ,"done" ,"down" ,"downwards" ,"during" ,"e" ,"each" ,"edu" ,"eg" ,"eight" ,"either" ,"else" ,"elsewhere" ,"enough" ,"entirely" ,"especially" ,"et" ,"etc" ,"even" ,"ever" ,"every" ,"everybody" ,"everyone" ,"everything" ,"everywhere" ,"ex" ,"exactly" ,"example" ,"except" ,"f" ,"far" ,"few" ,"fifth" ,"film" ,"first" ,"five" ,"flavor" ,"followed" ,"following" ,"follows" ,"for" ,"former" ,"formerly" ,"forth" ,"four" ,"from" ,"fun" ,"further" ,"furthermore" ,"g" ,"game" ,"game" ,"get" ,"gets" ,"getting" ,"given" ,"gives" ,"go" ,"goes" ,"going" ,"gone" ,"got" ,"gotten" ,"greetings" ,"grill" ,"guitar" ,"h" ,"had" ,"hadn" ,"hair" ,"happens" ,"hardly" ,"has" ,"hasn" ,"have" ,"haven" ,"having" ,"he" ,"hello" ,"help" ,"hence" ,"her" ,"here" ,"hereafter" ,"hereby" ,"herein" ,"hereupon" ,"hers" ,"herself" ,"hi" ,"him" ,"himself" ,"his" ,"hither" ,"hopefully" ,"how" ,"howbeit" ,"however" ,"i" ,"ie" ,"if" ,"ignored" ,"immediate" ,"in" ,"inasmuch" ,"inc" ,"indeed" ,"indicate" ,"indicated" ,"indicates" ,"ink" ,"inner" ,"insofar" ,"install" ,"instead" ,"into" ,"inward" ,"is" ,"isn" ,"it" ,"its" ,"itself" ,"j" ,"just" ,"k" ,"keep" ,"keeps" ,"kept" ,"kitchen" ,"knife" ,"know" ,"known" ,"knows" ,"l" ,"lamp" ,"laptop" ,"last" ,"lately" ,"later" ,"latter" ,"latterly" ,"least" ,"less" ,"lest" ,"let" ,"life" ,"like" ,"liked" ,"likely" ,"little" ,"ll" ,"look" ,"looking" ,"looks" ,"ltd" ,"m" ,"mainly" ,"many" ,"may" ,"maybe" ,"me" ,"mean" ,"meanwhile" ,"merely" ,"might" ,"mon" ,"more" ,"moreover" ,"most" ,"mostly" ,"movie" ,"mower" ,"much" ,"must" ,"my" ,"myself" ,"n" ,"name" ,"namely" ,"nd" ,"near" ,"nearly" ,"necessary" ,"need" ,"needs" ,"neither" ,"never" ,"nevertheless" ,"new" ,"next" ,"nine" ,"no" ,"nobody" ,"non" ,"none" ,"noone" ,"nor" ,"normally" ,"not" ,"nothing" ,"novel" ,"now" ,"nowhere" ,"o" ,"obviously" ,"of" ,"off" ,"often" ,"oh" ,"ok" ,"okay" ,"old" ,"on" ,"once" ,"one" ,"ones" ,"only" ,"onto" ,"or" ,"other" ,"others" ,"otherwise" ,"ought" ,"our" ,"ours" ,"ourselves" ,"out" ,"outside" ,"over" ,"overall" ,"own" ,"p" ,"particular" ,"particularly" ,"per" ,"perhaps" ,"phone" ,"placed" ,"please" ,"plus" ,"possible" ,"presumably" ,"printer" ,"probably" ,"product" ,"provides" ,"q" ,"que" ,"quite" ,"qv" ,"r" ,"rather" ,"rd" ,"re" ,"read" ,"read" ,"really" ,"reasonably" ,"regarding" ,"regardless" ,"regards" ,"relatively" ,"respectively" ,"right" ,"s" ,"said" ,"same" ,"saw" ,"say" ,"saying" ,"says" ,"second" ,"secondly" ,"see" ,"seeing" ,"seem" ,"seemed" ,"seeming" ,"seems" ,"seen" ,"self" ,"selves" ,"sensible" ,"sent" ,"serious" ,"seriously" ,"seven" ,"several" ,"shall" ,"shave" ,"she" ,"shoes" ,"should" ,"shouldn" ,"since" ,"six" ,"skin" ,"so" ,"some" ,"somebody" ,"somehow" ,"someone" ,"something" ,"sometime" ,"sometimes" ,"somewhat" ,"somewhere" ,"song" ,"songs" ,"soon" ,"sorry" ,"specified" ,"specify" ,"specifying" ,"still" ,"story" ,"strings" ,"stroller" ,"sub" ,"such" ,"sup" ,"sure" ,"t" ,"take" ,"taken" ,"taste" ,"tell" ,"tends" ,"th" ,"than" ,"thank" ,"thanks" ,"thanx" ,"that" ,"that" ,"thats" ,"the" ,"their" ,"theirs" ,"them" ,"themselves" ,"then" ,"thence" ,"there" ,"there" ,"thereafter" ,"thereby" ,"therefore" ,"therein" ,"theres" ,"thereupon" ,"these" ,"they" ,"think" ,"third" ,"this" ,"thorough" ,"thoroughly" ,"those" ,"though" ,"three" ,"through" ,"throughout" ,"thru" ,"thus" ,"to" ,"together" ,"too" ,"took" ,"toward" ,"towards" ,"toy" ,"tried" ,"tries" ,"truck" ,"truly" ,"try" ,"trying" ,"twice" ,"two" ,"u" ,"un" ,"under" ,"unfortunately" ,"unless" ,"unlikely" ,"until" ,"unto" ,"up" ,"upon" ,"us" ,"use" ,"used" ,"useful" ,"uses" ,"using" ,"usually" ,"v" ,"value" ,"various" ,"ve" ,"very" ,"via" ,"viz" ,"vs" ,"want" ,"wants" ,"was" ,"wasn" ,"way" ,"we" ,"wear" ,"welcome" ,"well" ,"went" ,"were" ,"weren" ,"what" ,"whatever" ,"when" ,"whence" ,"whenever" ,"where" ,"whereafter" ,"whereas" ,"whereby" ,"wherein" ,"whereupon" ,"wherever" ,"whether" ,"which" ,"while" ,"whither" ,"who" ,"whoever" ,"whole" ,"whom" ,"whose" ,"why" ,"will" ,"willing" ,"wish" ,"with" ,"within" ,"without" ,"won" ,"wonder" ,"would" ,"wouldn" ,"x" ,"y" ,"yes" ,"yet" ,"you" ,"your" ,"yours" ,"yourself" ,"yourselves" ,"z" ,"zero"]

In [4]:
rddfile = sc.textFile("hdfs:///user/dic23_shared/amazon-reviews/full/reviews_devset.json")

In [5]:
rddfile.first()

                                                                                

'{"reviewerID": "A2VNYWOPJ13AFP", "asin": "0981850006", "reviewerName": "Amazon Customer \\"carringt0n\\"", "helpful": [6, 7], "reviewText": "This was a gift for my other husband.  He\'s making us things from it all the time and we love the food.  Directions are simple, easy to read and interpret, and fun to make.  We all love different kinds of cuisine and Raichlen provides recipes from everywhere along the barbecue trail as he calls it. Get it and just open a page.  Have at it.  You\'ll love the food and it has provided us with an insight into the culture that produced it. It\'s all about broadening horizons.  Yum!!", "overall": 5.0, "summary": "Delish", "unixReviewTime": 1259798400, "reviewTime": "12 3, 2009", "category": "Patio_Lawn_and_Garde"}'

### Calculations

##### Total number of documents:

In [6]:
n_docs = rddfile.count()
n_docs

                                                                                

78829

##### Total number of documents for each category:

In [7]:
n_category = rddfile.map(lambda x: (json.loads(x)['category'],1)).reduceByKey(add)
n_category.collect()

                                                                                

[('Apps_for_Android', 2638),
 ('Book', 22507),
 ('Toys_and_Game', 2253),
 ('Office_Product', 1243),
 ('Digital_Music', 836),
 ('Automotive', 1374),
 ('Beauty', 2023),
 ('Kindle_Store', 3205),
 ('Electronic', 7825),
 ('Movies_and_TV', 4607),
 ('Tools_and_Home_Improvement', 1926),
 ('Grocery_and_Gourmet_Food', 1297),
 ('Patio_Lawn_and_Garde', 994),
 ('Sports_and_Outdoor', 3269),
 ('Musical_Instrument', 500),
 ('CDs_and_Vinyl', 3749),
 ('Clothing_Shoes_and_Jewelry', 5749),
 ('Home_and_Kitche', 4254),
 ('Cell_Phones_and_Accessorie', 3447),
 ('Pet_Supplie', 1235),
 ('Baby', 916),
 ('Health_and_Personal_Care', 2982)]

##### Tokenization:

In [8]:
def term_category(y):
    x = json.loads(y)
    catX = x["category"]
    revX = x["reviewText"]
    #split and distinct:
    output = list(set(re.split("[\\s_\\*+-:,;+=/\\\\\d%€§&$@#~`\"'\\.()\\t\\[\\]{}!?]", revX.casefold()))) 
    #clear from stopwords and output should not be a single character
    output = [i for i in output if i not in stopwords and len(i)>1] 
    terms = []
    for term in output:
        terms.append(((term, catX), 1))
    return terms

##### Total number of documents containing a term for each category (this is our A):

In [9]:
n_term_category = rddfile.flatMap(term_category).reduceByKey(add)
n_term_category.take(5)

                                                                                

[(('mic', 'Musical_Instrument'), 24),
 (('setups', 'Musical_Instrument'), 1),
 (('tape', 'Musical_Instrument'), 4),
 (('clip', 'Musical_Instrument'), 5),
 (('altering', 'Musical_Instrument'), 1)]

##### Total number of documents containing a term:

In [10]:
n_term = n_term_category.map(lambda x: (x[0][0],x[1])).reduceByKey(add)
n_term.take(10)

                                                                                

[('scripture', 115),
 ('verse', 73),
 ('tremendously', 58),
 ('chapter', 849),
 ('friendly', 310),
 ('panel', 110),
 ('lock', 249),
 ('fix', 424),
 ('picked', 562),
 ('free', 1901)]

##### Join n_term, n_category and n_term_category

In [11]:
n_term_category_key = n_term_category.map(lambda x: (x[0][0],(x[0][1],x[1])))
first_join = n_term_category_key.join(n_term)
first_join.take(5)

                                                                                

[('items', (('Musical_Instrument', 5), 691)),
 ('items', (('CDs_and_Vinyl', 4), 691)),
 ('items', (('Clothing_Shoes_and_Jewelry', 52), 691)),
 ('items', (('Home_and_Kitche', 88), 691)),
 ('items', (('Cell_Phones_and_Accessorie', 35), 691))]

In [12]:
first_join_key = first_join.map(lambda x: (x[1][0][0],((x[1][0][1],x[0]),x[1][1])))
second_join = first_join_key.join(n_category)
second_join.take(5)

                                                                                

[('Toys_and_Game', (((44, 'items'), 691), 2253)),
 ('Toys_and_Game', (((6, 'length'), 637), 2253)),
 ('Toys_and_Game', (((34, 'kit'), 365), 2253)),
 ('Toys_and_Game', (((28, 'quick'), 1432), 2253)),
 ('Toys_and_Game', (((315, 'good'), 16025), 2253))]

##### Calculating chisquared values

Reminder: input is in the form of (category, (((n_term_category, term), n_term), n_category_category))

In [13]:
def chisquare(thejoinedattributes):
    
    category = thejoinedattributes[0]
    term = thejoinedattributes[1][0][0][1]
    
    n_term_category = thejoinedattributes[1][0][0][0]
    n_term = thejoinedattributes[1][0][1]
    n_category_category = thejoinedattributes[1][1]
    
    A = n_term_category
    B = n_term - A
    C = n_category_category - A
    D = n_docs - n_category_category - B
    
    AD = A*D
    BC = B*C
    AB = A+B
    AC = A+C
    BD = B+D
    CD = C+D

    nominator = n_docs*(AD-BC)**2
    denominator = AB*AC*BD*CD
    chisquare = nominator/denominator
    
    return (category, (term, chisquare))

##### Applying the calculations

Reminder: we need to calculate all chisquares, group them by category in descending order, take the top 75 for each category, append them one after the other and sort everything by the category.

In [14]:
chisquared = second_join.map(chisquare).sortBy(lambda x: x[1][1], ascending=False).groupBy(lambda x: x[0]).flatMap(lambda g: nlargest(75, g[1], key=lambda x: x[1][1])).reduceByKey(add).sortByKey(True).collect()

                                                                                

### Saving to file

##### Generating the dictionary for the last line

In [15]:
lastline = set()

for i in chisquared:
    term_chi_array = i[1]
    for u in term_chi_array:
        if type(u) == str:
            lastline.add(u)

last_output = sorted(list(lastline))

##### Writing the chisquare results and the last line dictionary to output_rdd.txt

In [16]:
with open('output_rdd.txt', 'w') as f:
    for i in chisquared:
        f.write(re.sub("\(|\)|,|'|\[|\]", "", str(i)))
        f.write('\n')    
    f.write(str(re.sub(",|'|[0-9]|\[|\]|\.","", str(last_output))))
f.close()

## Part 2) Datasets/DataFrames: Spark ML and Pipelines

Convert the review texts to a classic vector space representation with TFIDF-weighted features based on the Spark DataFrame/Dataset API by building a transformation pipeline. The primary goal of this part is the preparation of the pipeline for Part 3 (see below). Note: although parts of this pipeline will be very similar to Assignment 1 or Part 1 above, do not expect to obtain identical results or have access to all intermediate outputs to compare the individual steps.

Use built-in functions for tokenization to unigrams at whitespaces, tabs, digits, and the delimiter characters ()[]{}.!?,;:+=-_"'`~#@&*%€$§\/, casefolding, stopword removal, TF-IDF calculation, and chi square selection ) (using 2000 top terms overall). Write the terms selected this way to a file output_ds.txt and compare them with the terms selected in Assignment 1. Describe your observations briefly in the submission report (see Part 3).

### Libraries

In [17]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF, Tokenizer, RegexTokenizer, StringIndexer, ChiSqSelector
from pyspark.ml import Pipeline
import re

In [18]:
spark = SparkSession.builder.getOrCreate()

### Building a Pipeline

##### Define the parts of the Pipeline

In [19]:
# read in the reviews json file from the hdfs directory and create a dataframe with the columns category and reviewText
textDF = spark.read.json("hdfs:///user/dic23_shared/amazon-reviews/full/reviews_devset.json").createOrReplaceTempView("review")
reviewTextDF = spark.sql("SELECT category,reviewText FROM review")

# Case folding and tokenization using whitespaces, tabs, digits
# and the characters ()[]{}.!?,;:+=-_"'`~#@&*%€$§\/ as delimiters
regexTokenizer = RegexTokenizer().setInputCol("reviewText").setOutputCol("terms").setPattern("[\\s_\\*+-:,;+=/\\\\%€[0-9]$&§@#~`\"'\\.()\\t\\[\\]{}!?]")

# stopword removal
remover = StopWordsRemover().setInputCol("terms").setOutputCol("filteredTerms")

# limit the vocabulary size to improve runtime.setMinDF(5) // min document frequency
countVec = CountVectorizer().setInputCol("filteredTerms").setOutputCol("features").setVocabSize(20000) 

# TF-IDF calculation
idf = IDF().setInputCol("features").setOutputCol("weightedFeatures")

# encode the categories as labels
encoder = StringIndexer().setInputCol("category").setOutputCol("categoryIndex")

# chi square selection (using 2000 top terms overall)
selector = ChiSqSelector().setNumTopFeatures(2000).setFeaturesCol("weightedFeatures").setLabelCol("categoryIndex").setOutputCol("selectedFeatures")

                                                                                

##### Stack together to a pipeline, fit and transform the model

In [20]:
pipeline = Pipeline().setStages([regexTokenizer, remover, countVec, idf, encoder, selector])

# fit the pipeline to training documents.
model = pipeline.fit(reviewTextDF)

finalDF = model.transform(reviewTextDF)
finalDF.show()

                                                                                

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|            category|          reviewText|               terms|       filteredTerms|            features|    weightedFeatures|categoryIndex|    selectedFeatures|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...|[this, was, a, gi...|[gift, husband, m...|(20000,[5,7,8,9,2...|(20000,[5,7,8,9,2...|         18.0|(2000,[5,7,8,9,27...|
|Patio_Lawn_and_Garde|This is a very ni...|[this, is, a, ver...|[nice, spreader, ...|(20000,[2,4,8,9,1...|(20000,[2,4,8,9,1...|         18.0|(2000,[2,4,8,9,16...|
|Patio_Lawn_and_Garde|The metal base wi...|[the, metal, base...|[metal, base, hos...|(20000,[6,20,21,3...|(20000,[6,20,21,3...|         18.0|(2000,[6,19,20,37...|
|Patio_Lawn_and_Garde|

### Get the distinct terms and sort them

In [21]:
selectedFeatures = model.stages[5].selectedFeatures
vocab = model.stages[2].vocabulary

output = set()
for i in selectedFeatures:
    output.add(vocab[i])

sorted_output = sorted(list(output))

### Save to output_ds.txt

In [22]:
with open('output_ds.txt', 'w') as f:  
    f.write(str(re.sub(",|'|[0-9]|\[|\]|\.","", str(sorted_output))))
f.close()

## Part 3) Text Classification

In this part, you will train a text classifier from the features extracted in Part 2. The goal is to learn a model that can predict the product category from a review's text.

To this end, extend the pipeline from Part 2 such that a Support Vector Machine classifier is trained. Since we are dealing with multi-class problems, make sure to put a strategy in place that allows binary classifiers to be applicable. Apply vector length normalization before feeding the feature vectors into the classifier (use Normalizer with L2 norm).

Follow best practices for machine learning experiment design and investigate the effects of parameter settings using the functions provided by Spark:

Split the review data into training, validation, and test set.

Make experiments reproducible.

Use a grid search for parameter optimization:

Compare chi square overall top 2000 filtered features with another, heavier filtering with much less dimensionality (see Spark ML documentation for options).

Compare different SVM settings by varying the regularization parameter (choose 3 different values), standardization of training features (2 values), and maximum number of iterations (2 values).

Use the MulticlassClassificationEvaluator to estimate performance of your trained classifiers on the test set, using F1 measure as criterion.

### Libraries

In [78]:
from pyspark.ml.linalg import Matrices
from pyspark.mllib.util import MLUtils
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

ImportError: cannot import name 'MulticlassMetrics' from 'pyspark.ml.evaluation' (/usr/lib/spark/python/pyspark/ml/evaluation.py)

In [24]:
# create a Dataframe with the relevant columns
mod = finalDF.select("categoryIndex","selectedFeatures").toDF("label","features")

### Normalize

In [25]:
# apply vector length normalization before feeding the feature vectors into the classifier (use Normalizer with L2 norm).
normalizer = Normalizer().setInputCol("features").setOutputCol("normFeatures").setP(2.0)
l2NormData = normalizer.transform(mod)

print("Normalized using L^2 norm")
l2NormData.show()

Normalized using L^2 norm
+-----+--------------------+--------------------+
|label|            features|        normFeatures|
+-----+--------------------+--------------------+
| 18.0|(2000,[5,7,8,9,27...|(2000,[5,7,8,9,27...|
| 18.0|(2000,[2,4,8,9,16...|(2000,[2,4,8,9,16...|
| 18.0|(2000,[6,19,20,37...|(2000,[6,19,20,37...|
| 18.0|(2000,[1,3,4,8,14...|(2000,[1,3,4,8,14...|
| 18.0|(2000,[1,36,41,43...|(2000,[1,36,41,43...|
| 18.0|(2000,[2,6,8,13,1...|(2000,[2,6,8,13,1...|
| 18.0|(2000,[6,14,18,49...|(2000,[6,14,18,49...|
| 18.0|(2000,[24,26,32,5...|(2000,[24,26,32,5...|
| 18.0|(2000,[8,11,20,23...|(2000,[8,11,20,23...|
| 18.0|(2000,[16,24,30,4...|(2000,[16,24,30,4...|
| 18.0|(2000,[1,4,5,17,1...|(2000,[1,4,5,17,1...|
| 18.0|(2000,[9,18,23,45...|(2000,[9,18,23,45...|
| 18.0|(2000,[4,11,14,18...|(2000,[4,11,14,18...|
| 18.0|(2000,[3,69,94,98...|(2000,[3,69,94,98...|
| 18.0|(2000,[1,2,8,9,13...|(2000,[1,2,8,9,13...|
| 18.0|(2000,[1,6,14,16,...|(2000,[1,6,14,16,...|
| 18.0|(2000,[1,4,6,17,2

In [26]:
mod = l2NormData.select("label","normFeatures").toDF("label","features")

# convert DataFrame columns and save as libsvm.
convertedVecDF = MLUtils.convertVectorColumnsToML(mod)
convertedVecDF.write.mode("overwrite").format("libsvm").save("data/Part3")

                                                                                

In [27]:
# load data file.
inputData = spark.read.format("libsvm").load("data/Part3")

23/05/29 12:01:20 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
                                                                                

In [28]:
# split the data into training, validation and test set.
# generate the train/test split.
# make experiments reproducible.
(train_validation,test) = inputData.randomSplit([0.8, 0.2], seed = 42)
(train, validation) = train_validation.randomSplit([0.8, 0.2], seed = 42)

In [29]:
# train a Support Vector Machine classifier.
lsvc = LinearSVC(maxIter=10, regParam=0.1)

# strategy that allows binary classifiers to be applicable: instantiate the One Vs Rest Classifier.
OVRclassifier = OneVsRest(classifier=lsvc)

In [30]:
# train the multiclass model.
OVRModel = OVRclassifier.fit(train)

23/05/29 12:01:27 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/05/29 12:01:27 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
23/05/29 12:01:27 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
23/05/29 12:01:27 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
                                                                                

## Parameters for Test Classifier

In [91]:
OVRModel.getClassifier().extractParamMap()

{Param(parent='LinearSVC_26a8d959c6df', name='maxBlockSizeInMB', doc='maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be >= 0.'): 0.0,
 Param(parent='LinearSVC_26a8d959c6df', name='threshold', doc='The threshold in binary classification applied to the linear model prediction.  This threshold can be any real number, where Inf will make all predictions 0.0 and -Inf will make all predictions 1.0.'): 0.0,
 Param(parent='LinearSVC_26a8d959c6df', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
 Param(parent='LinearSVC_26a8d959c6df', name='standardization', doc='whether to standardize the training features before fitting the model.'): True,
 Param(parent='LinearSVC_26a8d959c6df', name='fitIntercept', doc='whether to fit an intercept term.'): True,
 Para

In [31]:
# score the model on test data.
predictions = OVRModel.transform(validation)
predictions.take(1)

                                                                                

[Row(label=0.0, features=SparseVector(2000, {}), rawPrediction=DenseVector([-0.8415, -1.2584, -1.4338, -1.2125, -1.6101, -2.2794, -1.3386, -1.2592, -1.1368, -1.3162, -1.2537, -1.5464, -1.494, -1.3791, -1.5059, -1.6248, -1.5896, -1.742, -1.6181, -1.8916, -1.2212, -2.0833]), prediction=0.0)]

In [32]:
# use the MulticlassClassificationEvaluator to estimate performance of the trained classifiers on the test set.
# use F1 measure as criterion.
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

In [33]:
# compute the classification error on test data.
accuracy = evaluator.evaluate(predictions)

                                                                                

In [34]:
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.315052


# SVM Gridsearch
- Compare chi square overall top 2000 filtered features with another, heavier filtering with much less dimensionality (see Spark ML documentation for options).
- Compare different SVM settings by varying the regularization parameter (choose 3 different values), standardization of training features (2 values), and maximum number of iterations (2 values).

In [35]:
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

# train and validation data
train_validation
# test data
test

# train a Support Vector Machine classifier.
lsvc = LinearSVC()

# strategy that allows binary classifiers to be applicable: instantiate the One Vs Rest Classifier.
OVRclassifier = OneVsRest(classifier=lsvc)
mcevaluator = MulticlassClassificationEvaluator(metricName="f1")

# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
    .addGrid(lsvc.regParam, [0.2, 0.1, 0.01]) \
    .addGrid(lsvc.standardization, [False, True])\
    .addGrid(lsvc.maxIter, [5, 10])\
    .build()

# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=OVRclassifier,
                           estimatorParamMaps=paramGrid,
                           evaluator=mcevaluator,
                           # 80% of the data will be used for training, 20% for validation.
                           trainRatio=0.8)


## Generated Parameter Grid - Those get evaluated with F1

In [36]:
paramGrid

[{Param(parent='LinearSVC_3813097dd5cf', name='regParam', doc='regularization parameter (>= 0).'): 0.2,
  Param(parent='LinearSVC_3813097dd5cf', name='standardization', doc='whether to standardize the training features before fitting the model.'): False,
  Param(parent='LinearSVC_3813097dd5cf', name='maxIter', doc='max number of iterations (>= 0).'): 5},
 {Param(parent='LinearSVC_3813097dd5cf', name='regParam', doc='regularization parameter (>= 0).'): 0.2,
  Param(parent='LinearSVC_3813097dd5cf', name='standardization', doc='whether to standardize the training features before fitting the model.'): False,
  Param(parent='LinearSVC_3813097dd5cf', name='maxIter', doc='max number of iterations (>= 0).'): 10},
 {Param(parent='LinearSVC_3813097dd5cf', name='regParam', doc='regularization parameter (>= 0).'): 0.2,
  Param(parent='LinearSVC_3813097dd5cf', name='standardization', doc='whether to standardize the training features before fitting the model.'): True,
  Param(parent='LinearSVC_38130

In [37]:
# Run TrainValidationSplit, and choose the best set of parameters.
model2000 = tvs.fit(train_validation)


                                                                                

In [53]:
model2000.save("data/gridsearch-model-2000")

                                                                                

## Get chosen Parameters from Gridsearch

In [68]:
model2000.bestModel.getClassifier().getMaxIter()

10

In [67]:
model2000.bestModel.getClassifier().getStandardization()

True

In [70]:
model2000.bestModel.getClassifier().getRegParam()

0.01

In [75]:
model2000.bestModel.getClassifier().extractParamMap()

{Param(parent='LinearSVC_3813097dd5cf', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
 Param(parent='LinearSVC_3813097dd5cf', name='featuresCol', doc='features column name.'): 'features',
 Param(parent='LinearSVC_3813097dd5cf', name='fitIntercept', doc='whether to fit an intercept term.'): True,
 Param(parent='LinearSVC_3813097dd5cf', name='labelCol', doc='label column name.'): 'label',
 Param(parent='LinearSVC_3813097dd5cf', name='maxBlockSizeInMB', doc='maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be >= 0.'): 0.0,
 Param(parent='LinearSVC_3813097dd5cf', name='maxIter', doc='max number of iterations (>= 0).'): 10,
 Param(parent='LinearSVC_3813097dd5cf', name='predictionCol', doc='prediction column name.'): 'prediction',
 Param(parent='Linea

## Make Prediction

In [39]:
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
mypred2000 = model2000.transform(test)\
    .select("features", "label", "prediction")

## These are the Results for 2000 Features

In [72]:
MulticlassClassificationEvaluator(metricName="accuracy").evaluate(mypred2000)

                                                                                

0.6892236976506639

In [73]:
MulticlassClassificationEvaluator(metricName="f1").evaluate(mypred2000)

                                                                                

0.6597480289890458

In [87]:
# again 0.31 error
mypred2000.filter(mypred2000.prediction != mypred2000.label).count() / mypred2000.count()

                                                                                

0.31077630234933606

# Less dimensionalty - 50 Features

In [41]:
selector50 = ChiSqSelector().setNumTopFeatures(50).setFeaturesCol("weightedFeatures").setLabelCol("categoryIndex").setOutputCol("selectedFeatures")

pipeline50 = Pipeline().setStages([regexTokenizer, remover, countVec, idf, encoder, selector50])

# fit the pipeline to training documents.
model50 = pipeline50.fit(reviewTextDF)

finalDF50 = model50.transform(reviewTextDF)
finalDF50.show()

                                                                                

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|            category|          reviewText|               terms|       filteredTerms|            features|    weightedFeatures|categoryIndex|    selectedFeatures|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...|[this, was, a, gi...|[gift, husband, m...|(20000,[5,7,8,9,2...|(20000,[5,7,8,9,2...|         18.0|(50,[5,7,8,9,27,3...|
|Patio_Lawn_and_Garde|This is a very ni...|[this, is, a, ver...|[nice, spreader, ...|(20000,[2,4,8,9,1...|(20000,[2,4,8,9,1...|         18.0|(50,[2,4,8,9,16,1...|
|Patio_Lawn_and_Garde|The metal base wi...|[the, metal, base...|[metal, base, hos...|(20000,[6,20,21,3...|(20000,[6,20,21,3...|         18.0|(50,[6,19,20,37],...|
|Patio_Lawn_and_Garde|

In [42]:
mod50 = finalDF50.select("categoryIndex","selectedFeatures").toDF("label","features")
# apply vector length normalization before feeding the feature vectors into the classifier (use Normalizer with L2 norm).
normalizer = Normalizer().setInputCol("features").setOutputCol("normFeatures").setP(2.0)
l2NormData50 = normalizer.transform(mod50)

print("Normalized using L^2 norm")
l2NormData50.show()

Normalized using L^2 norm
+-----+--------------------+--------------------+
|label|            features|        normFeatures|
+-----+--------------------+--------------------+
| 18.0|(50,[5,7,8,9,27,3...|(50,[5,7,8,9,27,3...|
| 18.0|(50,[2,4,8,9,16,1...|(50,[2,4,8,9,16,1...|
| 18.0|(50,[6,19,20,37],...|(50,[6,19,20,37],...|
| 18.0|(50,[1,3,4,8,14,1...|(50,[1,3,4,8,14,1...|
| 18.0|(50,[1,36,41,43],...|(50,[1,36,41,43],...|
| 18.0|(50,[2,6,8,13,14,...|(50,[2,6,8,13,14,...|
| 18.0|(50,[6,14,18,49],...|(50,[6,14,18,49],...|
| 18.0|(50,[24,26,32],[2...|(50,[24,26,32],[0...|
| 18.0|(50,[8,11,20,23,2...|(50,[8,11,20,23,2...|
| 18.0|(50,[16,24,30,45]...|(50,[16,24,30,45]...|
| 18.0|(50,[1,4,5,17,18,...|(50,[1,4,5,17,18,...|
| 18.0|(50,[9,18,23,45,4...|(50,[9,18,23,45,4...|
| 18.0|(50,[4,11,14,18,2...|(50,[4,11,14,18,2...|
| 18.0|(50,[3],[1.551286...|      (50,[3],[1.0])|
| 18.0|(50,[1,2,8,9,13,1...|(50,[1,2,8,9,13,1...|
| 18.0|(50,[1,6,14,16,17...|(50,[1,6,14,16,17...|
| 18.0|(50,[1,4,6,17,22,

In [43]:
mod50 = l2NormData50.select("label","normFeatures").toDF("label","features")

# convert DataFrame columns and save as libsvm.
convertedVecDF50 = MLUtils.convertVectorColumnsToML(mod50)
convertedVecDF50.write.mode("overwrite").format("libsvm").save("data/Part3-50")

                                                                                

## SVM with 50 Features

In [44]:
# load data file. data/Part3-200 is also available
inputData50 = spark.read.format("libsvm").load("data/Part3-50")

23/05/29 12:11:29 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
                                                                                

In [45]:
# split the data into training, validation and test set.
# generate the train/test split.
# make experiments reproducible.
(train_validation50,test50) = inputData50.randomSplit([0.8, 0.2], seed = 42)

In [46]:
# Run TrainValidationSplit, and choose the best set of parameters.
model50 = tvs.fit(train_validation50)


23/05/29 12:12:27 ERROR OWLQN: Failure! Resetting history: breeze.optimize.NaNHistory: 
                                                                                

In [52]:
model50.save("data/gridsearch-model-50")

                                                                                

In [47]:
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
mypred50 = model50.transform(test50)\
    .select("features", "label", "prediction")

## Results for 50 Features

In [90]:
# 200 features has failure rate of 0.48
# 50 features has failure rate of 0.64
mypred50.filter(mypred50.prediction != mypred50.label).count() / mypred50.count()

                                                                                

0.6450459652706844