<a href="https://colab.research.google.com/github/PharahMain/Thinkful/blob/master/Amazon_Reviews_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz

In [3]:
# Install spark-related depdencies for Python
!pip install -q findspark
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 124kB/s 
[?25hCollecting py4j==0.10.7 (from pyspark)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 43.9MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130387 sha256=96143b7ca1c63aa3d1f83312b5f7458d841522dde58d578876871ae26c0867c3
  Stored in directory: /root/.cache/pip/wheels/ab/09/4d/0d184230058e654eb1b04467dbc1292f00eaa186544604b471
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 py

In [0]:
# Set up required environment variables

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [5]:
# Point Colaboratory to Google Drive

from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.sql.functions import isnan, when, count, col

In [0]:
JSON_PATH = "/content/gdrive/My Drive/Colab Datasets/amazon_reviews_video_games.json" 
APP_NAME = "Amazon Reviews Sentiment Analysis"
SPARK_URL = "local[*]"
RANDOM_SEED = 141107
TRAINING_DATA_RATIO = 0.8


First let's build a spark instance we will be working from.

In [0]:
spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()


Let's take a look closer look at the dataset.

In [9]:
df = spark.read.json(JSON_PATH)
df.show(5)
print(f"There are {df.count()} reviews in this dataset")

+----------+-------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|      asin|helpful|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|             summary|unixReviewTime|
+----------+-------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|0700099867|[8, 12]|    1.0|Installing the ga...| 07 9, 2012|A2HD75EMZR8QLN|                 123|Pay to unlock con...|    1341792000|
|0700099867| [0, 0]|    4.0|If you like rally...|06 30, 2013|A3UR8NLLY1ZHCX|Alejandro Henao "...|     Good rally game|    1372550400|
|0700099867| [0, 0]|    1.0|1st shipment rece...|06 28, 2014|A1INA0F5CWW3J4|Amazon Shopper "M...|           Wrong key|    1403913600|
|0700099867|[7, 10]|    3.0|I got this versio...|09 14, 2011|A1DLMTOTHQ4AST|            ampgreen|awesome game, if ...|    1315958400|
|0700099867| [2, 2]|    4.0|I had Dirt 2 on X...|06 14, 2011|A

Let's drop columns we don't need for this analysis.

In [10]:
# drop all columns except 'overall' and 'reviewText'. Or, inversely, we can just select the two columns of interest.
df = df.select('overall', 'reviewText')
df.show(5)

+-------+--------------------+
|overall|          reviewText|
+-------+--------------------+
|    1.0|Installing the ga...|
|    4.0|If you like rally...|
|    1.0|1st shipment rece...|
|    3.0|I got this versio...|
|    4.0|I had Dirt 2 on X...|
+-------+--------------------+
only showing top 5 rows



Since spark only accepts numerical values in its machine learning algorithms, we need to make some adjustments. Let's somewhat arbitrarily draw a line between favorable and unfavorable sentiment by selecting scores above 3.0 to be 'favorable' or 1 in this case, and 3.0 and lower scores as 'unfavorable' or 0.

In [11]:
df = df.withColumn('overall', when(col('overall') > 3.0, 1).otherwise(0))
df.show(5)

+-------+--------------------+
|overall|          reviewText|
+-------+--------------------+
|      0|Installing the ga...|
|      1|If you like rally...|
|      0|1st shipment rece...|
|      0|I got this versio...|
|      1|I had Dirt 2 on X...|
+-------+--------------------+
only showing top 5 rows



*Alternatively, I can use StringIndexer to retain all 5 (1-5) ratings. But for now, I will keep it simple.*

Now we need to introduce a dictionary with favorable and unfavorable words to compare with each reviewText.

In [0]:
favor_words = ['love', 'great', 'good', 'recommend', 'fun', 'amazing']
unfavor_words = ['disappointing', 'suck', 'bad', 'waste', 'return', 'not recommend']

We will then compare the words in both lists with each reviewText and create a feature set for each entry.

In [13]:
#df.withColumn('love', when(col('reviewText').like('%love%'), 1).otherwise(0)).show()
df.show(5)

+-------+--------------------+
|overall|          reviewText|
+-------+--------------------+
|      0|Installing the ga...|
|      1|If you like rally...|
|      0|1st shipment rece...|
|      0|I got this versio...|
|      1|I had Dirt 2 on X...|
+-------+--------------------+
only showing top 5 rows



In [0]:
def feature_selector(dataframe, word_list):
  
    for word in word_list:
        
        dataframe = dataframe.withColumn(word, when(col('reviewText').like(f"%{word}%"), True).otherwise(False))
                
    return dataframe
    
df = feature_selector(df, favor_words)


In [23]:
# repeat with unfavorable words
df = feature_selector(df, unfavor_words)
df.show(5)

+-------+--------------------+-----+-----+-----+---------+-----+-------+-------------+-----+-----+-----+------+-------------+
|overall|          reviewText| love|great| good|recommend|  fun|amazing|disappointing| suck|  bad|waste|return|not recommend|
+-------+--------------------+-----+-----+-----+---------+-----+-------+-------------+-----+-----+-----+------+-------------+
|      0|Installing the ga...|false|false|false|    false|false|  false|        false|false|false|false| false|        false|
|      1|If you like rally...|false|false|false|    false| true|  false|        false|false|false|false| false|        false|
|      0|1st shipment rece...|false|false| true|    false|false|  false|        false|false|false|false| false|        false|
|      0|I got this versio...|false| true| true|    false| true|  false|        false|false|false| true|  true|        false|
|      1|I had Dirt 2 on X...|false|false|false|    false| true|  false|        false|false|false|false| false|       

Next we need to vectorize the features.

In [24]:
# let's separate out the feature columns
features = df.drop('overall', 'reviewText')
features.show(5)


+-----+-----+-----+---------+-----+-------+-------------+-----+-----+-----+------+-------------+
| love|great| good|recommend|  fun|amazing|disappointing| suck|  bad|waste|return|not recommend|
+-----+-----+-----+---------+-----+-------+-------------+-----+-----+-----+------+-------------+
|false|false|false|    false|false|  false|        false|false|false|false| false|        false|
|false|false|false|    false| true|  false|        false|false|false|false| false|        false|
|false|false| true|    false|false|  false|        false|false|false|false| false|        false|
|false| true| true|    false| true|  false|        false|false|false| true|  true|        false|
|false|false|false|    false| true|  false|        false|false|false|false| false|        false|
+-----+-----+-----+---------+-----+-------+-------------+-----+-----+-----+------+-------------+
only showing top 5 rows



In [0]:
# time to build vectors for each review
assembler = VectorAssembler(inputCols=features.columns, outputCol="features_vector")
#assembler.transform(feature_cols).show(5)

In [0]:
# Split the data into training and validation sets (30% held out for testing)
(trainingData, testData) = df.randomSplit([TRAINING_DATA_RATIO, 1 - TRAINING_DATA_RATIO])

# Train a Naive-Bayes model.
nb = NaiveBayes(featuresCol="features_vector", labelCol="overall", modelType="bernoulli")

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[assembler, nb])

In [0]:
# build model using the training data
model = pipeline.fit(trainingData)

In [0]:
# make predictions with the test data
predictions = model.transform(testData)
#predictions.show(5)

In [43]:
# Select (prediction, true label) and compute test error
evaluator = BinaryClassificationEvaluator(
    labelCol="overall", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
accuracy = evaluator.evaluate(predictions)

print(f"Test Error = {(1.0 - accuracy):g}")
print(f"Accuracy = {accuracy:g}")

Test Error = 0.477458
Accuracy = 0.522542


That is awful. But this is to be expected since we are using a very simple dictionary consisting of 12 words to try to predict a sentiment. I will want to expand this analysis by deploying additional tools like word2vec.