# Load Data

First we load data from HDFS. It is stored as a trivial CSV file with three columns
1. product name
2. review text
3. rating (1 - 5)

In [30]:
from pyspark.sql.functions import *
from pyspark.sql.types import *


schema = StructType(
    [
        StructField('name', StringType(), True),
        StructField('review', StringType(), True),
        StructField('rating', IntegerType(), True),
    ]
)

raw_data = (
    spark.read.option("header", True)
    .schema(schema)
    .csv("s3a://dimajix-training/data/amazon_baby")
)
raw_data.limit(5).toPandas()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Clean and Cache Data

We need to clean up the data, since some reviews are NULL, and one of the downstream transformations apparently has some issues with NULL values.

For helping distributing the workload, we repartition the DataFrame and also cache it.

In [31]:
data = (
    raw_data.filter(col('rating').isNotNull())
    .filter(col('review').isNotNull())
    .repartition(31)
    .cache()
)

data.limit(5).toPandas()

Unnamed: 0,name,review,rating
0,Cloud b Gentle Giraffe On The Go Travel Sound ...,This is the best item ever to calm your baby t...,5
1,"Evenflo Tribute 5 Convertible Car Seat, Ella",Cheap. Feels cheap too. Doesn't feel sturdy. H...,3
2,"Britax Parkway SGL Booster Seat, Cardinal","Great product, well made, comfortable, and bei...",5
3,Boon Scrubble Interchangeable Bath Toy Squirt ...,These are great. They are a tad bit hard to p...,5
4,Summer Infant By Your Side Sleeper Portable Be...,"""I just purchased this co sleeper, so I will l...",4


# Extract Sentiment

Since we want to perform a classification (positive review vs negative review), we need to extract a binary sentiment value. We will map the ratings as follows:

1. Ratings 1 and 2 count as a negative review
2. Rating 3 counts as a neutral review
3. Ratings 4 and 5 count as a positive review

Since we want a binary classification, we will also remove neutral reviews altogether.

In [32]:
data = data.filter(data.rating != 3)
data = data.withColumn('sentiment', when(data.rating < 3, 0.0).otherwise(1.0))

data.limit(10).toPandas()

Unnamed: 0,name,review,rating,sentiment
0,Cloud b Gentle Giraffe On The Go Travel Sound ...,This is the best item ever to calm your baby t...,5,1.0
1,"Britax Parkway SGL Booster Seat, Cardinal","Great product, well made, comfortable, and bei...",5,1.0
2,Boon Scrubble Interchangeable Bath Toy Squirt ...,These are great. They are a tad bit hard to p...,5,1.0
3,Summer Infant By Your Side Sleeper Portable Be...,"""I just purchased this co sleeper, so I will l...",4,1.0
4,"Planet Wise Hanging Wet/Dry Diaper Bag, Black","I am an avid checker of product reviews, so I ...",4,1.0
5,"Kushies Waterproof Bib with Sleeves, Blue Circ...",Love this bib! The design is great and it work...,5,1.0
6,"Evenflo Classic Johnny Jump Up, Frogs",We have two of door jumps - two different bran...,1,0.0
7,Baby Einstein Take Along Tunes,This has been a favorite of my daughters since...,5,1.0
8,North States Supergate Classic Plastic Gate Mo...,I like the gate but hate my house. Unfortunate...,4,1.0
9,Primo 4-In-1 Soft Seat Toilet Trainer and Step...,My daughter is 19 mos old (I got this about a ...,4,1.0


# Extract Features from Reviews

Now we want to split the review text into individual words, so we can create a "bag of words" model. In order to get a somewhat nice model, we also need to remove all punctuations from the reviews. This will be done as the first step using a user defined function (UDF) in PySpark.

In [40]:
import string

from pyspark.sql.types import *


def cleanup_text(text):
    if text:
        for c in string.punctuation:
            text = text.replace(c, ' ')
    return text


remove_punctuation = udf(cleanup_text, StringType())
data2 = data.withColumn('review', remove_punctuation('review'))

## Split Reviews into Words
We could do that ourselves using the Python split method, but we use a Transformer provided by PySpark instead. Saves us some time and helps to create clean code.

In [41]:
from pyspark.ml.feature import *


tokenizer = Tokenizer(inputCol='review', outputCol='words')
words = tokenizer.transform(data2)

words.limit(3).toPandas()

Unnamed: 0,name,review,rating,sentiment,words
0,Cloud b Gentle Giraffe On The Go Travel Sound ...,This is the best item ever to calm your baby t...,5,1.0,"[this, is, the, best, item, ever, to, calm, yo..."
1,"Britax Parkway SGL Booster Seat, Cardinal",Great product well made comfortable and bei...,5,1.0,"[great, product, , well, made, , comfortable, ..."
2,Boon Scrubble Interchangeable Bath Toy Squirt ...,These are great They are a tad bit hard to p...,5,1.0,"[these, are, great, , , they, are, a, tad, bit..."


## Remove Stop words

We also want to remove so called stop words, which are all those tiny words which mainly serve as glue for building sentences. Usually they do not contain much information in a simple bag of words model. So we get rid of them.

This is so common practice that PySpark already contains a Transformer for just doing that.

In [42]:
stopWords = [
    'the',
    'a',
    'and',
    'or',
    'it',
    'this',
    'of',
    'an',
    'as',
    'in',
    'on',
    'is',
    'are',
    'to',
    'was',
    'for',
    'then',
    'i',
]

stopWordsRemover = StopWordsRemover(
    inputCol='words', outputCol='vwords', stopWords=stopWords
)
vwords = stopWordsRemover.transform(words)

vwords.limit(3).toPandas()

Unnamed: 0,name,review,rating,sentiment,words,vwords
0,Cloud b Gentle Giraffe On The Go Travel Sound ...,This is the best item ever to calm your baby t...,5,1.0,"[this, is, the, best, item, ever, to, calm, yo...","[best, item, ever, calm, your, baby, sleep, , ..."
1,"Britax Parkway SGL Booster Seat, Cardinal",Great product well made comfortable and bei...,5,1.0,"[great, product, , well, made, , comfortable, ...","[great, product, , well, made, , comfortable, ..."
2,Boon Scrubble Interchangeable Bath Toy Squirt ...,These are great They are a tad bit hard to p...,5,1.0,"[these, are, great, , , they, are, a, tad, bit...","[these, great, , , they, tad, bit, hard, push,..."


## Create Bag of Words Features

Finally we simply count the number of occurances of all words within the reviews. Again we can simply use a Transformer from PySpark to perform that task.

In [43]:
countVectorizer = CountVectorizer(inputCol='vwords', outputCol='features', minDF=2.0)
countVectorizerModel = countVectorizer.fit(vwords)

## Inspect Vocabulary

The countVectorizerModel contains an implcit vocabulary containing all words. This can be useful for mapping features back to words

In [44]:
print(countVectorizerModel.vocabulary[0:50])

['', 'my', 'that', 'with', 'we', 'have', 'but', 't', 'so', 's', 'not', 'you', 'baby', 'they', 'one', 'very', 'when', 'great', 'can', 'be', 'he', 'she', 'would', 'at', 'just', 'up', 'use', 'these', 'out', 'all', 'our', 'them', 'like', 'love', 'had', 'her', 'if', 'easy', 'has', 'little', 'seat', 'old', 'well', 'get', 'only', 'from', 'will', 'product', 'because', 'more']


# Tidy up DataFrame

We now carry so many columns inside the DataFrame, let's remove some intermediate columns to get more focus on our model.

In [45]:
features = countVectorizerModel.transform(vwords).drop('words')

features.limit(3).toPandas()

Unnamed: 0,name,review,rating,sentiment,vwords,features
0,Cloud b Gentle Giraffe On The Go Travel Sound ...,This is the best item ever to calm your baby t...,5,1.0,"[best, item, ever, calm, your, baby, sleep, , ...","(2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,"Britax Parkway SGL Booster Seat, Cardinal",Great product well made comfortable and bei...,5,1.0,"[great, product, , well, made, , comfortable, ...","(5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,Boon Scrubble Interchangeable Bath Toy Squirt ...,These are great They are a tad bit hard to p...,5,1.0,"[these, great, , , they, tad, bit, hard, push,...","(6.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ..."


# Split Train Data / Test Data

Now let's do the usual split of our data into a training data set and a validation data set. Let's use 80% of all reviews for training and 20% for validation

In [46]:
train_data, test_data = features.randomSplit([0.8, 0.2], seed=0)

print("train_data: %d" % train_data.count())
print("test_data: %d" % test_data.count())

train_data: 126874
test_data: 31705


# Train Classifier

There are many different classification algorithms out there. We will use a LogisticRegression, of course a DecisionTreeClassifier could be another interesting option.

In [47]:
from pyspark.ml.classification import *


logisticRegression = LogisticRegression(featuresCol='features', labelCol='sentiment')
logisticModel = logisticRegression.fit(train_data)

## Inspect Model
The LogisticRegressionModel also uses coefficients mapped to individual words. Let's have a look at them.

In [48]:
print(logisticModel.coefficients.toArray()[0:20])

[ 0.01168209  0.22786136  0.20238767  0.24145364  0.65199258  0.18077175
 -0.34238165 -2.41905135  0.60320699  0.39776083 -3.29387396 -0.06376773
  0.07940401  0.21586512  0.23324266 -0.17306954  0.07453342  3.96279634
  1.03241461 -0.13461789]


In [50]:
numPositiveWeights = len(
    list(filter(lambda x: x > 0, logisticModel.coefficients.toArray()))
)
numNegativeWeights = len(
    list(filter(lambda x: x < 0, logisticModel.coefficients.toArray()))
)

print("Number positive weights %d" % numPositiveWeights)
print("Number negative weights %d" % numNegativeWeights)

Number positive weights 18092
Number negative weights 11196


## Find Weights of some Words

Let's check how coefficients look like for some clearly positive or negative words

In [51]:
def print_word_weight(word):
    index = countVectorizerModel.vocabulary.index(word)
    weight = logisticModel.coefficients[index]
    print('%s : %f' % (word, weight))


print_word_weight('good')
print_word_weight('great')
print_word_weight('bad')
print_word_weight('ugly')

good : 1.937799
great : 3.962796
bad : -0.802468
ugly : 2.498138


## Find Extreme Words

Let us try to find the most positive and most negative word according to the weights. This can be achieved using numpy argmin function to find the index and the vocabulary to map the index to the actual word.

In [52]:
import numpy as np


worstWordIndex = np.argmin(logisticModel.coefficients.toArray())
worstWord = countVectorizerModel.vocabulary[worstWordIndex]
worstWeight = logisticModel.coefficients[worstWordIndex]
print("Worst word: %s  value %f" % (worstWord, worstWeight))

bestWordIndex = np.argmax(logisticModel.coefficients.toArray())
bestWord = countVectorizerModel.vocabulary[bestWordIndex]
bestWeight = logisticModel.coefficients[bestWordIndex]
print("Best word: %s value %f" % (bestWord, bestWeight))

Worst word: mooing  value -141.764890
Best word: breezes value 161.181740


# Making Predictions

The primary idea is of course to make predictions of the sentiment using the learned model.

In [53]:
pred = logisticModel.transform(test_data)

pred.drop('features').limit(10).toPandas()

Unnamed: 0,name,review,rating,sentiment,vwords,rawPrediction,probability,prediction
0,,As a mom of twins these holders are my favorit...,5,1.0,"[mom, twins, these, holders, my, favorite, , t...","[-17.932458840780992, 17.932458840780992]","[1.6294163559712737e-08, 0.9999999837058364]",1.0
1,,He is real tough and has been hit several time...,5,1.0,"[he, real, tough, has, been, hit, several, tim...","[-17.415478840064583, 17.415478840064583]","[2.7324588365491493e-08, 0.9999999726754117]",1.0
2,,I did a lot of reviews on all the play yards a...,5,1.0,"[did, lot, reviews, all, play, yards, availabl...","[-70.40874225030952, 70.40874225030952]","[2.641628630667009e-31, 1.0]",1.0
3,'The Insulated Sippy' by Eco Vessel Insulated ...,This is was my son s favorite sippy We used...,5,1.0,"[, , my, son, s, favorite, sippy, , we, used, ...","[1.082589017250085, -1.082589017250085]","[0.7469836180690544, 0.2530163819309455]",0.0
4,(1) Cresci Products Window Wedge (2 Per Pack) ...,The means of attaching to a window was pretty ...,4,1.0,"[means, attaching, window, pretty, straight, f...","[-3.739933618946556, 3.739933618946556]","[0.023204442690751417, 0.9767955573092485]",1.0
5,*SPECIAL PROMOTION*The Art of CureTM *SAFETY K...,My toddler 2 yrs he was drooling all over his...,5,1.0,"[my, toddler, 2, yrs, , he, drooling, all, ove...","[-7.951579391667169, 7.951579391667169]","[0.00035198167784236385, 0.9996480183221577]",1.0
6,*The Art of CureTM *SAFETY KNOTTED* - Mixed Co...,I can t say anything but WONDERFUL things abou...,5,1.0,"[can, t, say, anything, but, wonderful, things...","[-62.246020447388446, 62.246020447388446]","[9.266096083825356e-28, 1.0]",1.0
7,10 Piece Baby Rattle and Teether Toy Gift Set ...,bought it for my 3 month old obviously she ll...,4,1.0,"[bought, my, 3, month, old, , obviously, she, ...","[-11.52526303479438, 11.52526303479438]","[9.877284697866307e-06, 0.9999901227153021]",1.0
8,1st Bday Girl Autograph Bear,I loved the idea of an autograph teddy bear th...,4,1.0,"[loved, idea, autograph, teddy, bear, that, my...","[-10.70198297372375, 10.70198297372375]","[2.2499769938835655e-05, 0.9999775002300613]",1.0
9,2 Inch Foam Bassinet Mattress - 16&quot; x 32&...,i so wanted to love this because it took me fo...,2,0.0,"[so, wanted, love, because, took, me, forever,...","[10.377909578448454, -10.377909578448454]","[0.9999688887382504, 3.1111261749518125e-05]",0.0


## Find the most Positive Review

Using the column rawPrediction, we can find the review which has the highest positive prediction.

In [54]:
# Extract one component from a Vectors
extract_from_vector = udf(lambda v, i: float(v[i]), FloatType())

positives = pred.orderBy(extract_from_vector(pred.rawPrediction, lit(1)).desc())

positives.limit(6).toPandas()

Unnamed: 0,name,review,rating,sentiment,vwords,features,rawPrediction,probability,prediction
0,BabyPlus Prenatal Education System,I started wearing the Babyplus when I was 18 ...,5,1.0,"[, started, wearing, babyplus, when, 18, weeks...","(123.0, 5.0, 0.0, 1.0, 2.0, 0.0, 3.0, 3.0, 2.0...","[-433.31092507126573, 433.31092507126573]","[6.538171273697786e-189, 1.0]",1.0
1,"P'Kolino Silly Soft Seating in Tias, Green",I ve purchased both the P Kolino Little Reader...,4,1.0,"[ve, purchased, both, p, kolino, little, reade...","(61.0, 9.0, 4.0, 4.0, 3.0, 4.0, 8.0, 2.0, 0.0,...","[-423.4328898048961, 423.4328898048961]","[1.2747719578282095e-184, 1.0]",1.0
2,"UPPAbaby Cruz Stroller, Denny",I have the 2012 model and have been using it ...,4,1.0,"[have, 2012, model, , have, been, using, 8, mo...","(177.0, 14.0, 15.0, 12.0, 2.0, 6.0, 12.0, 6.0,...","[-381.801368396545, 381.801368396545]","[1.5338133811104968e-166, 1.0]",1.0
3,Chariot 2011 Cabriolet Bicycle Trailer **CLOSE...,We were looking for a bike trailer JUST a bi...,4,1.0,"[we, were, looking, bike, trailer, , , just, b...","(185.0, 5.0, 16.0, 11.0, 5.0, 6.0, 16.0, 10.0,...","[-348.18818234818536, 348.18818234818536]","[6.078462507294663e-152, 1.0]",1.0
4,Roan Rocco Classic Pram Stroller 2-in-1 with B...,Great Pram Rocco I bought this pram from ...,5,1.0,"[great, pram, rocco, , , , , , bought, pram, f...","(67.0, 3.0, 5.0, 6.0, 4.0, 3.0, 2.0, 5.0, 4.0,...","[-342.70461393419976, 342.70461393419976]","[1.4631108994202463e-149, 1.0]",1.0
5,"Evenflo 6 Pack Classic Glass Bottle, 4-Ounce",It s always fun to write a review on those pro...,5,1.0,"[s, always, fun, write, review, those, product...","(219.0, 9.0, 16.0, 13.0, 13.0, 4.0, 12.0, 8.0,...","[-337.6993363255923, 337.6993363255923]","[2.1829394596232316e-147, 1.0]",1.0


# Evaluation of Prediction

Again we want to assess the performance of the prediction model. This can be done using the builtin class BinaryClassificationEvaluator.

In [56]:
from pyspark.ml.evaluation import *


evaluator = BinaryClassificationEvaluator(labelCol='sentiment')
result = evaluator.evaluate(pred)

print(result)

0.8755224793908036


# Custom Evaluator

We want to use a different metric namely accuracy. Accuracy is defined as

    number_correct_predictions / total_number_predictions
    
First let us directly calculate that metric

In [57]:
num_total = pred.count()
num_correct = pred.filter(pred.sentiment == pred.prediction).count()

model_accuracy = float(num_correct) / num_total

print("Model Accuracy: %f" % (float(num_correct) / num_total))

Model Accuracy: 0.884151


## Compare with Dummy Predictor

It is always interesting to see how a trivial prediction performs. The trivial predictor simply predicts the most common class for all objects. In this case this would be a positive review.

In [58]:
num_total = pred.count()
num_positive = pred.filter(pred.sentiment == 1.0).count()

baseline_accuracy = float(num_positive) / num_total

print("Baseline Accuracy: %f" % (float(num_positive) / num_total))

Baseline Accuracy: 0.846428
