# Load Data

First we load data from HDFS. It is stored as a trivial CSV file with three columns
1. product name
2. review text
3. rating (1 - 5)

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *


schema =  StructType([
    StructField('name',StringType(),True),
    StructField('review',StringType(), True),
    StructField('rating',IntegerType(), True),
])

data = spark.read.option("header", True).schema(schema).csv("s3a://dimajix-training/data/amazon_baby")
data.limit(5).toPandas()

## Clean and Cache Data

We need to clean up the data, since some reviews are NULL, and one of the downstream transformations apparently has some issues with NULL values.

For helping distributing the workload, we repartition the DataFrame and also cache it.

In [None]:
data = raw_data.filter(col('rating').isNotNull()) \
    .filter(col('review').isNotNull()) \
    .repartition(31) \
    .cache()

data.limit(5).toPandas()

# Extract Sentiment

Since we want to perform a classification (positive review vs negative review), we need to extract a binary sentiment value. We will map the ratings as follows:

1. Ratings 1 and 2 count as a negative review
2. Rating 3 counts as a neutral review
3. Ratings 4 and 5 count as a positive review

Since we want a binary classification, we will also remove neutral reviews altogether.

In [None]:
# Remove all reviews with a rating of 3 from data
data = ...

# Add new column sentiment according to rules above
data = ...

# Extract Features from Reviews

Now we want to split the review text into individual words, so we can create a "bag of words" model. In order to get a somewhat nice model, we also need to remove all punctuations from the reviews. This will be done as the first step using a user defined function (UDF) in PySpark.

In [None]:
import string

from pyspark.sql.types import *


def cleanup_text(text):
    if text:
        for c in string.punctuation:
            text = text.replace(c, ' ')
    return text

# Register cleanup_text as a UDF
remove_punctuation = ...

# Apply udf to data and store result in data2. The cleaned column should be called 'review' again
data2 = ...

## Split Reviews into Words
We could do that ourselves using the Python split method, but we use a Transformer provided by PySpark instead. Saves us some time and helps to create clean code.

In [None]:
from pyspark.ml.feature import *


# Create appropriate instance of PySpark Tokenizer, such that 
# reviews will be split up and stored in a new column 'words'
tokenizer = ...

# Create new DataFrame words by applying the Tokenizer to data2
words = ...

# Fetch first 3 rows and display them as a Pandas DataFrame
...

## Remove Stop words

We also want to remove so called stop words, which are all those tiny words which mainly serve as glue for building sentences. Usually they do not contain much information in a simple bag of words model. So we get rid of them.

This is so common practice that PySpark already contains a Transformer for just doing that.

In [None]:
stopWords=['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']

# Create an instance of StopWordRemover. Store the result in new column 'vwords'.
stopWordsRemover = ...

# Create new DataFrame words by applying the StopWordRemover to data2
vwords = ...

# Fetch first 3 rows and display them as a Pandas DataFrame
...

## Create Bag of Words Features

Finally we simply count the number of occurances of all words within the reviews. Again we can simply use a Transformer from PySpark to perform that task.

In [None]:
# Create instance of CountVectorizer, store results in column 'features'
# Set additional parameter minDF to 2.0, such that each word needs to appear in at least two documents.
countVectorizer = ...

# Create a model from the CountVectorizer by fitting vwords
countVectorizerModel = ...

## Inspect Vocabulary

The countVectorizerModel contains an implcit vocabulary containing all words. This can be useful for mapping features back to words

In [None]:
print(countVectorizerModel.vocabulary[0:50])

# Tidy up DataFrame

We now carry so many columns inside the DataFrame, let's remove some intermediate columns to get more focus on our model.

In [None]:
# Extract features by using the model to transform vwords
features = ...

# Display first three rows of result as Pandas DataFrame
...

# Split Train Data / Test Data

Now let's do the usual split of our data into a training data set and a validation data set. Let's use 80% of all reviews for training and 20% for validation

In [None]:
train_data, test_data = features.randomSplit([0.8,0.2], seed=0)

print("train_data: %d" % train_data.count())
print("test_data: %d" % test_data.count())

# Train Classifier

There are many different classification algorithms out there. We will use a LogisticRegression, of course a DecisionTreeClassifier could be another interesting option.

In [None]:
from pyspark.ml.classification import *


logisticRegression = LogisticRegression(featuresCol='features',labelCol='sentiment')
logisticModel = logisticRegression.fit(train_data)

## Inspect Model
The LogisticRegressionModel also uses coefficients mapped to individual words. Let's have a look at them.

In [None]:
print logisticModel.coefficients.toArray()[0:20]

In [None]:
numPositiveWeights = len(filter(lambda x: x > 0, logisticModel.coefficients.toArray()))
numNegativeWeights = len(filter(lambda x: x < 0, logisticModel.coefficients.toArray()))

print("Number positive weights %d" % numPositiveWeights)
print("Number negative weights %d" % numNegativeWeights)

## Find Weights of some Words

Let's check how coefficients look like for some clearly positive or negative words

In [None]:
# Define a function which prints the weight of a given word
def print_word_weight(word):
    # First you need to find the word in the vocabulary of the countVectorizerModel. 
    # You need the index within the vocubalary. Note: Python arrays have a nice method called 'index'
    index = ...
    # Now lookup the weight in the model's coefficients using the index
    weight = ...
    print('%s : %f' % (word, weight))
    
print_word_weight('good')
print_word_weight('great')    
print_word_weight('bad')
print_word_weight('ugly')

## Find Extreme Words

Let us try to find the most positive and most negative word according to the weights. This can be achieved using numpy argmin function to find the index and the vocabulary to map the index to the actual word.

In [None]:
import numpy as np


# Find the index of the coefficient with the lowest value. Note np.argmin could be your friend
worstWordIndex = ...
# Find the word belonging to the index in the models vocabulary
worstWord = ...
# Find the weight belonging to the index
worstWeight = ...
print("Worst word: %s  value %f" % (worstWord, worstWeight))

# Repeat exercise with most positive word
bestWordIndex = ...
bestWord = ...
bestWeight = ...
print("Best word: %s value %f" % (bestWord, bestWeight))

# Making Predictions

The primary idea is of course to make predictions of the sentiment using the learned model.

In [None]:
pred = logisticModel.transform(test_data)

pred.drop('features').limit(10).toPandas()

## Find the most Positive Review

Using the column rawPrediction, we can find the review which has the highest positive prediction.

In [None]:
# Extract one component from a Vectors
extract_from_vector = udf(lambda v,i : float(v[i]), FloatType())

positives = pred.orderBy(extract_from_vector(pred.rawPrediction,lit(1)).desc())

positives.limit(6).toPandas()

# Evaluation of Prediction

Again we want to assess the performance of the prediction model. This can be done using the builtin class BinaryClassificationEvaluator.

In [None]:
from pyspark.ml.evaluation import *


evaluator = BinaryClassificationEvaluator(labelCol='sentiment')
result = evaluator.evaluate(pred)

print(result)

# Custom Evaluator

We want to use a different metric namely accuracy. Accuracy is defined as

    number_correct_predictions / total_number_predictions
    
First let us directly calculate that metric

In [None]:
# Get the total number of predictions
num_total = ...
# Get the number of correct predictions according to learned model
num_correct = ...

model_accuracy = float(num_correct) / num_total

print("Model Accuracy: %f" % (float(num_correct) / num_total))

## Compare with Dummy Predictor

It is always interesting to see how a trivial prediction performs. The trivial predictor simply predicts the most common class for all objects. In this case this would be a positive review.

In [None]:
# Get total number of predictions
num_total = ...
# Get the number of correct predictions according to baseline classifier, which always predicts "positive"
num_correct = ...

baseline_accuracy = float(num_correct) / num_total

print("Baseline Accuracy: %f" % (float(num_correct) / num_total))