# Modeling - Tree Methods + K-Means
---

Buiding off of the tree-based modeling from the other notebook `20-review-prediction.ipynb`, we will try and predict the review rating (1, 2, 3, 4, 5) from the review's text and other features.

One of our methods includes K-Means clustering the reviews, based on the review text, and using that clustering a feature in our random forest and decision tree classifiers

## Package Installation + Setup
---

Because multiple team members used google drive, please check the drive path in each notebook. They might be slightly different!

In [5]:
# Install PySpark pacakges and set environment variables to use Spark on Colab
!pip install pyspark 
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq

!pip install pysparkling

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

In [None]:
# Google Drive Authentication
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
# PROJECT PATH
cur_path = "/content/drive/MyDrive/Colab Notebooks/BigDataScaling/Project/"
os.chdir(cur_path)
!pwd

/content/drive/MyDrive/Colab Notebooks/BigDataScaling/Project


In [None]:
import pandas as pd
import numpy as np

# Import VectorAssembler and Vectors

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler, StringIndexer, Word2Vec
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier

# PySpark SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import substring
from pyspark.sql.types import StringType,BooleanType,DateType,IntegerType
from pyspark.sql.functions import isnan, when, count, col
from pyspark.sql.functions import *
import pyspark.sql.functions as f

# NLP / PySparkling
from pysparkling import *  
from nltk.corpus import stopwords  

In [None]:
# create a spark session
spark = SparkSession.builder.appName('tree').getOrCreate()

## Load + Clean Data
---

Load the Disneyland Reviews CSV and clean year/month and branch columns

In [None]:
# Load  data
data = spark.read.csv(cur_path + 'DisneylandReviews.csv',inferSchema=True,header=True)

In [None]:
# Create Year, Month column
data = data.withColumn('Year', substring('Year_Month', 1,4))
data = data.withColumn('Month', substring('Year_Month', 6, len('Year_Month')))
# Clean Branch Name
data = data.withColumn('Branch_Clean', substring('Branch', 12, 50))

In [None]:
data.printSchema()

root
 |-- Review_ID: integer (nullable = true)
 |-- Rating: integer (nullable = true)
 |-- Year_Month: string (nullable = true)
 |-- Reviewer_Location: string (nullable = true)
 |-- Review_Text: string (nullable = true)
 |-- Branch: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- Branch_Clean: string (nullable = true)



In [None]:
# data preview
data.show()

+---------+------+----------+--------------------+--------------------+-------------------+----+-----+------------+
|Review_ID|Rating|Year_Month|   Reviewer_Location|         Review_Text|             Branch|Year|Month|Branch_Clean|
+---------+------+----------+--------------------+--------------------+-------------------+----+-----+------------+
|670772142|     4|    2019-4|           Australia|If you've ever be...|Disneyland_HongKong|2019|    4|    HongKong|
|670682799|     4|    2019-5|         Philippines|Its been a while ...|Disneyland_HongKong|2019|    5|    HongKong|
|670623270|     4|    2019-4|United Arab Emirates|Thanks God it was...|Disneyland_HongKong|2019|    4|    HongKong|
|670607911|     4|    2019-4|           Australia|HK Disneyland is ...|Disneyland_HongKong|2019|    4|    HongKong|
|670607296|     4|    2019-4|      United Kingdom|the location is n...|Disneyland_HongKong|2019|    4|    HongKong|
|670591897|     3|    2019-4|           Singapore|Have been to Disn...|D

In [None]:
# Change Year/Month to integer
data = data.withColumn('Year', col("Year").cast(IntegerType()))
data = data.withColumn('Month', col('Month').cast(IntegerType()))

## Drop Mising Data
---

Only ~2.6k reocrds (6%) were missing a review date. As a result, we are going to drop them

In [None]:
data.select([count(when(col('Year').isNull(),True))]).show()

+---------------------------------------------+
|count(CASE WHEN (Year IS NULL) THEN true END)|
+---------------------------------------------+
|                                         2613|
+---------------------------------------------+



In [None]:
# drop that data
data = data.na.drop('any')

In [None]:
# Confirming dropped records
data.select([count(when(col('Year').isNull(),True))]).show()

+---------------------------------------------+
|count(CASE WHEN (Year IS NULL) THEN true END)|
+---------------------------------------------+
|                                            0|
+---------------------------------------------+



## Encoding String Features
---

Using a `StringIndexer` ([link](https://spark.apache.org/docs/latest/ml-features#stringindexer)) to encode the Branch name and the reviewer's Country

In [None]:
# deal with string features
indexer_location = StringIndexer(inputCol="Reviewer_Location", outputCol="LocationIndex")
indexer_branch = StringIndexer(inputCol='Branch_Clean', outputCol="BranchIndex")

In [None]:
data_fixed = indexer_location.fit(data).transform(data)
data_fixed = indexer_branch.fit(data_fixed).transform(data_fixed)

## Tokenize Review Text
---

To predict a review's rating from their text, we have are going tokenize the reviews and remove stop words. We can then creat a TFIDF with a min-frequecy of 6 to remove any sparse/rarely used words.

The output hash of our cleaned, tokenized review text (`hashtf`) is then fit by a `Word2Vec` model and fed into a k-means clustering model.

In [None]:

tokenizer = \
    RegexTokenizer(inputCol='Review_Text', outputCol = 'tokenized_words', pattern="\\W+", minTokenLength = 3)

text_data = tokenizer.transform(data_fixed)
remover = StopWordsRemover(inputCol='tokenized_words', outputCol = 'word_tokens')
text_data = remover.transform(text_data)

hashtf = HashingTF(numFeatures=2**16, inputCol="word_tokens", outputCol='tf')
idf = IDF(inputCol='tf', outputCol="tfidf", minDocFreq=5) #minDocFreq: remove sparse terms

In [None]:
text_data = hashtf.transform(text_data.select('word_tokens'))

In [None]:
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="word_tokens", outputCol="sentence")

text = text_data.select('word_tokens')
model = word2Vec.fit(text)

## K-Means Clustering with tokenized review text

In [None]:
text_data = model.transform(text_data)

In [None]:
assembler_kmeans = VectorAssembler(
  inputCols=['sentence'],
              outputCol="features")

kmeans_df = assembler_kmeans.transform(text_data)

In [None]:
# intialize and fit k-means clustering model (2 clusters)
kmeans = KMeans(featuresCol='features').setK(2).setSeed(1)
km_model2 = kmeans.fit(kmeans_df)

In [None]:
# Show predictions from our dataset
predictions2 = km_model2.transform(kmeans_df)
predictions2.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|17553|
|         0|22490|
+----------+-----+



In [None]:
w = Window.orderBy(lit(1))
data_fixed = data_fixed.withColumn("rn", row_number().over(w)-1)
predictions2 = predictions2.withColumn("rn", row_number().over(w)-1)

In [None]:
data_fixed = data_fixed.join(predictions2,["rn"]).drop("rn")

In [None]:
data_fixed = data_fixed.drop('features')
data_fixed.show()

+---------+------+----------+--------------------+--------------------+-------------------+----+-----+------------+-------------+-----------+--------------------+--------------------+--------------------+----------+
|Review_ID|Rating|Year_Month|   Reviewer_Location|         Review_Text|             Branch|Year|Month|Branch_Clean|LocationIndex|BranchIndex|         word_tokens|                  tf|            sentence|prediction|
+---------+------+----------+--------------------+--------------------+-------------------+----+-----+------------+-------------+-----------+--------------------+--------------------+--------------------+----------+
|670772142|     4|    2019-4|           Australia|If you've ever be...|Disneyland_HongKong|2019|    4|    HongKong|          2.0|        2.0|[ever, disneyland...|(65536,[329,4756,...|[-0.0936634524935...|         1|
|670682799|     4|    2019-5|         Philippines|Its been a while ...|Disneyland_HongKong|2019|    5|    HongKong|          5.0|       

## Tree-Based Classifiers (Decision Tree, Random Forest)
---

Trying to predict the rating (1, 2, 3, 4, 5) is a multinomial classifications problem. We're going to see how well we can predict the rating using both a Decision Tree and Random Forest Classifiers.

We're also going to train/test the models on two versions of our datasets with slightly different features! This will result in 4 total models (2x RT, 2x DT)

In [None]:
# Creating dataset 1
assembler = VectorAssembler(
  inputCols=['Year',
             'Month',
             'BranchIndex'],
              outputCol="features")

output = assembler.transform(data_fixed)
final_data = output.select('features','Rating')

# 70/30 train/test split
train_data, test_data = final_data.randomSplit([0.7,0.3])

In [None]:
# Creating dataset 2 using our K-Means clustering
assembler3 = VectorAssembler(
  inputCols=['Year',
             'Month',
             'prediction',
             'BranchIndex'],
              outputCol="features")

output3 = assembler3.transform(data_fixed)
final_data3 = output3.select('features','Rating')

# 70/30 train/test split
train_data3, test_data3 = final_data3.randomSplit([0.7,0.3])

### Initialize models

In [None]:
# Use mostly defaults to make this comparison "fair"
dtc = DecisionTreeClassifier(labelCol='Rating',featuresCol='features')
rfc = RandomForestClassifier(labelCol='Rating',featuresCol='features')

In [None]:
# Set hyperparameters (Decision Tree)
dtc.setMaxDepth(30)
dtc.setMaxBins(32)

# Set hyperparameters (Random Forest)
rfc.setMaxDepth(30)
rfc.setMaxBins(32)
rfc.setNumTrees(500)

RandomForestClassifier_16a8902ec8b4

### Train Models

In [None]:
# Train Decision Tree model -- 2x (on different data splits)
dtc_model = dtc.fit(train_data)
dtc_model3 = dtc.fit(train_data3)

# Record predictions on test sets
dtc_predictions = dtc_model.transform(test_data)
dtc_predictions3 = dtc_model3.transform(test_data3)

In [None]:
# Train Random Forest model- - 2x (on different data splits)
rfc_model = rfc.fit(train_data)
rfc_model3 = rfc.fit(train_data3)

# Record predictions on test sets
rfc_predictions = rfc_model.transform(test_data)
rfc_predictions3 = rfc_model3.transform(test_data3)

### Model Results

In [None]:
# Select (prediction, true label) and compute test error
acc_evaluator = MulticlassClassificationEvaluator(labelCol="Rating", predictionCol="prediction", metricName="accuracy")

In [None]:
dtc_acc = acc_evaluator.evaluate(dtc_predictions)
rfc_acc = acc_evaluator.evaluate(rfc_predictions)

In [None]:
print("Here are the results!")
print('-'*80)
print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))
print('-'*80)
print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc*100))
#print('-'*80)
#print('A ensemble using GBT had an accuracy of: {0:2.2f}%'.format(gbt_acc*100))

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an accuracy of: 54.27%
--------------------------------------------------------------------------------
A random forest ensemble had an accuracy of: 54.62%


In [None]:
dtc_acc3 = acc_evaluator.evaluate(dtc_predictions3)
rfc_acc3 = acc_evaluator.evaluate(rfc_predictions3)

print("Here are the results!")
print('-'*80)
print('A single decision tree had an accuracy of: {0:2.2f}%'.format(dtc_acc3*100))
print('-'*80)
print('A random forest ensemble had an accuracy of: {0:2.2f}%'.format(rfc_acc3*100))

Here are the results!
--------------------------------------------------------------------------------
A single decision tree had an accuracy of: 53.34%
--------------------------------------------------------------------------------
A random forest ensemble had an accuracy of: 54.09%


In [None]:
print(rfc_model.featureImportances)
print(rfc_model3.featureImportances)

(3,[0,1,2],[0.15384163958097158,0.16651018795900555,0.6796481724600228])
(4,[0,1,2,3],[0.08663907617636457,0.08792418341963637,0.4543978833540674,0.37103885704993167])
