<a href="https://colab.research.google.com/github/Sharon-Tseng/ISOM3770_big_data/blob/main/assignment3_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3: Customer Review Sentiment Prediction

Customer reviews are a great source of “Voice of customer” and could offer tremendous insights into what customers like and dislike about a product or service. For the e-commerce business, customer reviews are very critical, since existing reviews heavily influence buying decision of new customers in the absence of the actual look and feel of the product to be purchased.

In this assignment, you will use a sample Amazon product review dataset to build a sentiment classifier with Spark ML.

For cell that starts with **TO COMPLETE**, complete the code.

For submission, you will download this notebook and submit it on Canvas.

# Step 1: Set up PySpark on Google Colab. (may take a few minutes)

Select the cell below, and run the code by clicking the **Play icon** in the left gutter of the cell (or you can type **Cmd/Ctrl+Enter** to run the cell in place;)

In [None]:
# download Java 11
!apt-get install openjdk-11-jdk-headless -qq > /dev/null

# download Spark 3.4.0 + Hadoop 3
!wget -q https://apache.root.lu/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
!tar xf spark-3.4.0-bin-hadoop3.tgz

# install findspark
!pip install -q findspark

# setup enviornment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.0-bin-hadoop3"

import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext

from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.feature import Tokenizer, CountVectorizer


# Step 2: Download dataset

After running the cell, you should be able to see the file in the left bar under "Files".

Notice: if you leave Google Colab idle for a long time, Google will remove the so you need to upload it again.

In [None]:
!wget -O amazon_review.csv https://www.dropbox.com/scl/fi/45zrsbmcredg19hpy51ty/amazon_review.csv?rlkey=qed1zc9r5iewb6r3lt0gv0xpl&st=cpw41fkj&dl=0

--2025-05-01 14:50:43--  https://www.dropbox.com/scl/fi/45zrsbmcredg19hpy51ty/amazon_review.csv?rlkey=qed1zc9r5iewb6r3lt0gv0xpl
Resolving www.dropbox.com (www.dropbox.com)... 162.125.2.18, 2620:100:6017:18::a27d:212
Connecting to www.dropbox.com (www.dropbox.com)|162.125.2.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ucde3afa04ab1004b29020a12a49.dl.dropboxusercontent.com/cd/0/inline/Co3OQ_IA5SsW7RszcKcbtMcIvmNXimjDBc4r4rUyDfy6He71JvE3wiN7vgEOrOhDBu_MwDpxYwTokqlwVQ9fX0yHVUS_mN-iBFh4TCAnxUQ52lE0Fj-A_Oh21OHUxFRVjXQ/file# [following]
--2025-05-01 14:50:44--  https://ucde3afa04ab1004b29020a12a49.dl.dropboxusercontent.com/cd/0/inline/Co3OQ_IA5SsW7RszcKcbtMcIvmNXimjDBc4r4rUyDfy6He71JvE3wiN7vgEOrOhDBu_MwDpxYwTokqlwVQ9fX0yHVUS_mN-iBFh4TCAnxUQ52lE0Fj-A_Oh21OHUxFRVjXQ/file
Resolving ucde3afa04ab1004b29020a12a49.dl.dropboxusercontent.com (ucde3afa04ab1004b29020a12a49.dl.dropboxusercontent.com)... 162.125.2.15, 2620:100:6017:15::a27d:20f
Connecting to uc

# Step 3:  Load data into Spark DataFrame

In [None]:
df = spark.read.csv('amazon_review.csv', header = True, inferSchema = True)
print('the dataset contains number of rows: ' + str(df.count()))

the dataset contains number of rows: 9022


In [None]:
# TO COMPLETE
# show the content of the first 10 rows of the table without truncating any columns
df.show(10)

+-----+--------------------+
|label|              review|
+-----+--------------------+
|    1|                  ok|
|    1|Perfect, even stu...|
|    0|If the words, &#3...|
|    1|Exactly what I wa...|
|    1|I will look past ...|
|    0|The controls are ...|
|    0|The printer came ...|
|    1|Great camera for ...|
|    1|Product is very g...|
|    0|Lasted a few hour...|
+-----+--------------------+
only showing top 10 rows



## Step 4: Data Preparation
For text analysis, you need to convert input text into number vectors that computer can read.

In [None]:
# TO COMPLETE. Tokenize the review sentence into words
# Use tokenizer
tokenized = Tokenizer(inputCol = 'review', outputCol='words')
tokenized_df = tokenized.transform(df)
tokenized_df.show(10, truncate = False)

+-----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
# TO COMPLETE. Use countVectorizer to transform word vectors into feature vectors
# You can also use HashingTF to build feature vectors
cv = CountVectorizer(inputCol = 'words', outputCol = 'features', minDF=1)
cv_df = cv.fit(tokenized_df).transform(tokenized_df)
cv_df.show(10, truncate = False)

+-----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Step 5: Split dataset into training and testing
Since classification is a supervised learning problem, we need to split the dataset into training and testing with 70/30. We train the model on  training set and testing the model performance on testing set.

In [None]:
train, test = cv_df.select('label', 'features').randomSplit([0.7,0.3])
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

Training Dataset Count: 6374
Test Dataset Count: 2648


## Step 6: Train and evaluate a Logistic Regression Classification model

In [None]:
#TO COMPLETE
# Train a logistic regression model with training data and evaluate model performance

lr = LogisticRegression(maxIter = 10, regParam = 0.3)
lrModel = lr.fit(train)
pred = lrModel.transform(test)
pred.show(10)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    0|(28257,[0,1,2,3,4...|[15.0989884334558...|[0.99999972292807...|       0.0|
|    0|(28257,[0,1,2,3,4...|[8.27521624209572...|[0.99974531191627...|       0.0|
|    0|(28257,[0,1,2,3,4...|[8.27521624209572...|[0.99974531191627...|       0.0|
|    0|(28257,[0,1,2,3,4...|[2.36653146726978...|[0.91423929792174...|       0.0|
|    0|(28257,[0,1,2,3,4...|[1.72100882668406...|[0.84825873395800...|       0.0|
|    0|(28257,[0,1,2,3,4...|[1.34462316733475...|[0.79324919158409...|       0.0|
|    0|(28257,[0,1,2,3,4...|[3.56963703217614...|[0.97260551978572...|       0.0|
|    0|(28257,[0,1,2,3,4...|[-0.3066726668734...|[0.42392711021461...|       1.0|
|    0|(28257,[0,1,2,3,4...|[0.54896890822311...|[0.63389633676857...|       0.0|
|    0|(28257,[0

In [None]:
# AUC
evaluator = BinaryClassificationEvaluator(rawPredictionCol = 'rawPrediction', labelCol= 'label', metricName = 'areaUnderROC')
auc = evaluator.evaluate(pred)
print('area under ROC curve is:', auc)

area under ROC curve is: 0.8783803819749745


## Step 7: Train and evaluate a Random Forest Classification model

In [None]:
# TO COMPLETE
# Train a Random Forest Classification model with training data and evaluate model performance

rf = RandomForestClassifier(numTrees = 25, maxDepth = 25)
pred_rf = rf.fit(train).transform(test)
pred_rf.show(10)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    0|(28257,[0,1,2,3,4...|[11.1223028344505...|[0.44489211337802...|       1.0|
|    0|(28257,[0,1,2,3,4...|[22.7335448604477...|[0.90934179441791...|       0.0|
|    0|(28257,[0,1,2,3,4...|[22.7335448604477...|[0.90934179441791...|       0.0|
|    0|(28257,[0,1,2,3,4...|[11.132873467355,...|[0.44531493869419...|       1.0|
|    0|(28257,[0,1,2,3,4...|[8.09696781552737...|[0.32387871262109...|       1.0|
|    0|(28257,[0,1,2,3,4...|[7.69069514509105...|[0.30762780580364...|       1.0|
|    0|(28257,[0,1,2,3,4...|[10.7252801125531...|[0.42901120450212...|       1.0|
|    0|(28257,[0,1,2,3,4...|[8.8311616022283,...|[0.35324646408913...|       1.0|
|    0|(28257,[0,1,2,3,4...|[7.56265357816875...|[0.30250614312675...|       1.0|
|    0|(28257,[0

In [None]:
# AUC
evaluator = BinaryClassificationEvaluator(rawPredictionCol = 'rawPrediction', labelCol= 'label', metricName = 'areaUnderROC')
auc = evaluator.evaluate(pred_rf)
print('area under ROC curve is:', auc)

area under ROC curve is: 0.8336759832976467


## Question:
What's the best model performance you can get? What is your model setting, (eg. the number of trees in random forest)

*   Hint: your best model performance should be greater than 0.80. If not, there are something wrong with your data features.




**Answer:**
1. Logistic Regression Classification Model:
  - AUC = 0.87
  - maxIter = 10, regParam = 0.3
2. Random Forest Classification Model:
  - AUC = 0.83
  - numTrees = 25, maxDepth = 25