# W266 Final Project Code
# Amazon Product Review Aspect-Based Sentiment
## Jennifer Mahle and Joanna Wang (Sections 3 and 1, respectively) 

#### Introduction
For our final project, we built a classification system for Amazon product reviews. The system categorizes product reviews into various classes of what the review focuses on, then determines whether the review is positive or negative for a given product trait (ie durability, quality, etc). As a user, star ratings alone might not give enough information about the product, so reading the reviews still is the best way to determine if the product fits the user’s needs. The challenge is, sometimes there can be hundreds of reviews for a product and users cannot spend time reading all of them.  So we want to provide this classification system to reduce the review reading process and help the users to find what they need. 


### Exploratory Data Analysis

In this section, we load, clean, and explore the data. We are using Amazon product reviews for electronics from the website https://nijianmo.github.io/amazon/index.html

In [11]:
import warnings

warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", FutureWarning)
warnings.simplefilter("ignore", DeprecationWarning)

In [12]:
import os
import pandas as pd

#dataset = "Electronics_5.json"
#df = pd.read_json("Electronics_5.json", lines=True)

df = pd.read_csv("mini_x_train.csv") 

display(df.tail(10))

Unnamed: 0,asin,image,overall,reviewText,reviewTime,reviewerID,reviewerName,style,summary,unixReviewTime,verified,vote
6739580,B01HJCN1EI,,5,These are my favorite charging cords for a few...,"07 25, 2017",A1OOVLE2KZ6KGA,Puddzee,,Worth the price.,1500940800,True,
6739581,B01HJCN1EI,,1,"Update....after 2 months of gentle use, cable ...","04 4, 2017",A77K1B31UAQ29,addictedtoreading,,UPDATE...BREAKS AND SLOW CHARGING,1491264000,True,
6739582,B01HJH42KU,,3,These are okay. The connection becomes very if...,"07 8, 2017",A2SVXUVUAWUDK2,Andrew,,Hope this makes sense. You'd understand if you...,1499472000,True,
6739583,B01HJH42KU,,2,I liked the length and the product at first bu...,"05 21, 2017",A12E1JGKV0ETAB,John Adams,,Lost ability to connect.,1495324800,True,
6739584,B01HJH40WU,,3,not holding up over time :(,"06 26, 2017",A1HKXEX8BEQC2E,Dasha stephens,,not holding up over time :(,1498435200,True,
6739585,B01HJH40WU,,4,"These seem like quality USB cables, time will ...","03 21, 2017",A33MAQA919J2V8,Kurt Wurm,,Four Stars,1490054400,True,
6739586,B01HJH40WU,,4,"Works great, love the longer cord. As with any...","01 9, 2017",A1AKHSCPD1BHM4,C.L Momof3,,Nice long cord,1483920000,True,
6739587,B01HJH40WU,,5,"Ok here is an odd thing that happened to me, I...","12 1, 2016",A2HUZO7MQAY5I2,michael clontz,,Not the correct product as linked in the sale.,1480550400,True,2.0
6739588,B01HJH40WU,,5,Works well.,"11 29, 2016",AJJ7VX2L91X2W,Faith,,Five Stars,1480377600,True,2.0
6739589,B01HJF704M,,5,I have it plugged into a usb extension on my g...,"03 31, 2017",A1FGCIRPRNZWD5,Brando,,Works well enough..,1490918400,True,


In [13]:
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6739590 entries, 0 to 6739589
Data columns (total 12 columns):
asin              object
image             object
overall           int64
reviewText        object
reviewTime        object
reviewerID        object
reviewerName      object
style             object
summary           object
unixReviewTime    int64
verified          bool
vote              object
dtypes: bool(1), int64(2), object(9)
memory usage: 572.0+ MB
None


In [14]:
from datetime import datetime

condition = lambda row: datetime.fromtimestamp(row).strftime("%m-%d-%Y")
df["unixReviewTime"] = df["unixReviewTime"].apply(condition)

In [15]:
df.drop(labels="reviewTime", axis=1, inplace=True)

display(df.head())

Unnamed: 0,asin,image,overall,reviewText,reviewerID,reviewerName,style,summary,unixReviewTime,verified,vote
0,151004714,,5,This is the best novel I have read in 2 or 3 y...,AAP7PPBU72QFM,D. C. Carrad,{'Format:': ' Hardcover'},A star is born,09-17-1999,True,67
1,151004714,,3,"Pages and pages of introspection, in the style...",A2E168DTVGE6SV,Evy,{'Format:': ' Kindle Edition'},A stream of consciousness novel,10-22-2013,True,5
2,151004714,,5,This is the kind of novel to read when you hav...,A1ER5AYS3FQ9O3,Kcorn,{'Format:': ' Paperback'},I'm a huge fan of the author and this one did ...,09-01-2008,False,4
3,151004714,,5,What gorgeous language! What an incredible wri...,A1T17LMQABMBN5,Caf Girl Writes,{'Format:': ' Hardcover'},The most beautiful book I have ever read!,09-03-2000,False,13
4,151004714,,3,I was taken in by reviews that compared this b...,A3QHJ0FXK33OBE,W. Shane Schmidt,{'Format:': ' Hardcover'},A dissenting view--In part.,02-03-2000,True,8


In [5]:
print(df["reviewText"].iloc[0])

This is the best novel I have read in 2 or 3 years.  It is everything that fiction should be -- beautifully written, engaging, well-plotted and structured.  It has several layers of meanings -- historical, family,  philosophical and more -- and blends them all skillfully and interestingly.  It makes the American grad student/writers' workshop "my parents were  mean to me and then my professors were mean to me" trivia look  childish and silly by comparison, as they are.
Anyone who says this is an  adolescent girl's coming of age story is trivializing it.  Ignore them.  Read this book if you love literature.
I was particularly impressed with  this young author's grasp of the meaning and texture of the lost world of  French Algeria in the 1950's and '60's...particularly poignant when read in  1999 from another ruined and abandoned French colony, amid the decaying  buildings of Phnom Penh...
I hope the author will write many more books  and that her publishers will bring her first novel ba

In [6]:
print(df.overall.unique())

[5 3 4 2 1]


In [7]:
sample_review = df["reviewText"].iloc[1689185]
print(sample_review)

Del sistema digital anexo pros y contras
- Muy bueno porque reduce la necesidad de instalaciones
- Opera con varios equipos
- El tamao es adecuado

- Si hay variacion de voltaje la velocidad se ve mermada
- El cable de alimentacion es muy corto (que se puede conseguir)

En terminos generales estoy satisfecho


In [8]:
import html

decoded_review = html.unescape(sample_review)
print(decoded_review)

Del sistema digital anexo pros y contras
- Muy bueno porque reduce la necesidad de instalaciones
- Opera con varios equipos
- El tamao es adecuado

- Si hay variacion de voltaje la velocidad se ve mermada
- El cable de alimentacion es muy corto (que se puede conseguir)

En terminos generales estoy satisfecho


In [17]:
print("Data Shape: ",df.shape)
no_NA_reviews = df.dropna(subset=['reviewText'])
print("Data Shape Dropping Observations with NA for the Review Text: ", no_NA_reviews.shape)

Data Shape:  (6739590, 11)
Data Shape Dropping Observations with NA for the Review Text:  (6738237, 11)


### Text Encoding using Universal Sentence Encoder

In the subsequent code cells, we load the Universal Sentence Encoder (USE), break the data into training and testing data, and apply the USE to the data. 

In [None]:
# Remove ## from lines starting with ! and run them the first time to install necessary packages 

##%%capture
# Install the latest Tensorflow version.
##!pip3 install --upgrade tensorflow-gpu
# Install TF-Hub.
##!pip3 install tensorflow-hub
##!pip3 install seaborn


In [20]:
#@title Load the Universal Sentence Encoder's TF Hub module
from absl import logging

import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [18]:
#split the data into training and testing data, using "overall" as the target variable
y=no_NA_reviews.overall
x=no_NA_reviews.drop('overall',axis=1)


In [23]:
from sklearn.model_selection import train_test_split

#x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
mini_x_train,mini_x_test,mini_y_train,mini_y_test=train_test_split(x,y,test_size=0.9999)

df.to_csv(r'C:\Users\jrmah\Desktop\datasci-w266-finalProject\mini_x_train.csv')

In [None]:
#create embeddings on the training data 
logging.set_verbosity(logging.ERROR)
#message_embeddings = embed(x_train.reviewText)
message_embeddings = embed(mini_x_train.reviewText)

In [29]:
print("Training X Shape", mini_x_train.shape)
print("Testing X Shape", mini_x_test.shape)

Training X Shape (673, 10)
Testing X Shape (6737564, 10)


In [32]:
mini_x_train.head()

Unnamed: 0,asin,image,reviewText,reviewerID,reviewerName,style,summary,unixReviewTime,verified,vote
5311304,B00Z6XGX98,,"OMG, everyone that listens to music through ea...",A3EA01H53H6KT6,John Henson,,I love this brand the most,03-28-2015,True,
1379848,B0023RRCP4,,"Didn't help at all, returned for refund",A16HCPQ51L8IUR,John Gruszynski,,Two Stars,11-15-2014,True,
5447345,B0131PBN6U,,Great monitor for the price. Paired with my rx...,A1WCDPSDPJOSVF,Max,"{'Size:': ' 24""', 'Style:': ' With Base Stand'}",Five Stars,07-04-2016,True,
936446,B000Y9TZ9Y,,Great Product,A2WSD8KNLRE34X,David W. Wollenschlager,{'Package Type:': ' Frustration-Free Packaging'},Five Stars,09-27-2015,True,
1786054,B0047K0036,,"Worth every penny, If you don't like bright t...",A2WHJW3APWPJ2B,sky,{'Color:': ' black'},If you don't like bright tweet clear mid and l...,07-25-2016,False,2.0


In [17]:
message_embeddings[0]

<tf.Tensor 'strided_slice_1:0' shape=(512,) dtype=float32>

In [None]:
#!python -m pip install --upgrade pip
#!pip install torch
#!pip install stanfordnlp

In [35]:
# need to install java (unless you already have it installed) 
# and update the path to where ever it is stored on your computer
import os
java_path = "C:/Program Files/Java/jre1.8.0_241/bin/java.exe"
os.environ['JAVAHOME'] = java_path

# need to follow instructions to install Stanford POS tagger here: 
# https://phitchuria.wordpress.com/2018/09/29/python-nltk-using-stanford-pos-tagger-in-nltk-on-windows/
from nltk.tag import StanfordPOSTagger
stanford_dir = "C:\Stanford\stanford-postagger-2018-10-16"
modelfile = stanford_dir+"\models\english-bidirectional-distsim.tagger"
jarfile=stanford_dir+"\stanford-postagger.jar"

tagger=StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)
tagged_POS = tagger.tag(mini_x_train.reviewText)


In [36]:
tagged_sent_POS = tagger.tag_sents(mini_x_train.reviewText)

Loading default properties from tagger C:\Stanford\stanford-postagger-2018-10-16\models\english-bidirectional-distsim.tagger
Loading POS tagger from C:\Stanford\stanford-postagger-2018-10-16\models\english-bidirectional-distsim.tagger ... done [1.9 sec].
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:87)
	at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:37)
	at edu.stanford.nlp.tagger.maxent.TestSentence.runTagInference(TestSentence.java:341)
	at edu.stanford.nlp.tagger.maxent.TestSentence.testTagInference(TestSentence.java:328)
	at edu.stanford.nlp.tagger.maxent.TestSentence.tagSentence(TestSentence.java:151)
	at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagSentence(MaxentTagger.java:1052)
	at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagCoreLabelsOrHasWords(MaxentTagger.java:1843)
	at edu.stanford.nlp.tagge

OSError: Java command failed : ['C:/Program Files/Java/jre1.8.0_241/bin/java.exe', '-mx1000m', '-cp', 'C:\\Stanford\\stanford-postagger-2018-10-16\\stanford-postagger.jar', 'edu.stanford.nlp.tagger.maxent.MaxentTagger', '-model', 'C:\\Stanford\\stanford-postagger-2018-10-16\\models\\english-bidirectional-distsim.tagger', '-textFile', 'C:\\Users\\jrmah\\AppData\\Local\\Temp\\tmpat7t7pda', '-tokenize', 'false', '-outputFormatOptions', 'keepEmptySentences', '-encoding', 'utf8']