# W266 Final Project Code
# Amazon Product Review Aspect-Based Sentiment
## Jennifer Mahle and Joanna Wang (Sections 3 and 1, respectively) 

#### Introduction
For our final project, we built a classification system for Amazon product reviews. The system categorizes product reviews into various classes of what the review focuses on, then determines whether the review is positive or negative for a given product trait (ie durability, quality, etc). As a user, star ratings alone might not give enough information about the product, so reading the reviews still is the best way to determine if the product fits the user’s needs. The challenge is, sometimes there can be hundreds of reviews for a product and users cannot spend time reading all of them.  So we want to provide this classification system to reduce the review reading process and help the users to find what they need. 


### Exploratory Data Analysis

In this section, we load, clean, and explore the data. We are using Amazon product reviews for electronics from the website https://nijianmo.github.io/amazon/index.html

In [None]:
import warnings
import os
import pandas as pd

warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", FutureWarning)
warnings.simplefilter("ignore", DeprecationWarning)

In [None]:
dataset = "Electronics_5.json"
df = pd.read_json("Electronics_5.json", lines=True)

#df = pd.read_csv("mini_x_train.csv") 

display(df.tail())

In [None]:
print(df.info())

In [None]:
from datetime import datetime

condition = lambda row: datetime.fromtimestamp(row).strftime("%m-%d-%Y")
df["unixReviewTime"] = df["unixReviewTime"].apply(condition)

In [None]:
df.drop(labels="reviewTime", axis=1, inplace=True)

display(df.head())

In [None]:
print(df["reviewText"].iloc[0])

In [None]:
print(df.overall.unique())

In [None]:
sample_review = df["reviewText"].iloc[1689185]
print(sample_review)

In [None]:
import html

decoded_review = html.unescape(sample_review)
print(decoded_review)

In [None]:
print("Data Shape: ",df.shape)
no_NA_reviews = df.dropna(subset=['reviewText'])
print("Data Shape Dropping Observations with NA for the Review Text: ", no_NA_reviews.shape)

In [None]:
#split the data into training and testing data, using "overall" as the target variable
y=no_NA_reviews.overall
x=no_NA_reviews.drop('overall',axis=1)

In [None]:
from sklearn.model_selection import train_test_split

#x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
#mini_x_train,mini_x_test,mini_y_train,mini_y_test=train_test_split(x,y,test_size=0.9999)

#df_mini= no_NA_reviews.sample(n=600)
#df_mini.shape
#df_mini.to_csv(r'C:\Users\jrmah\Desktop\datasci-w266-finalProject\df_mini.csv')

asin_subset = no_NA_reviews[(no_NA_reviews.asin == "B01HJCN1EI") | (no_NA_reviews.asin == "B01HJH42KU") | 
                            (no_NA_reviews.asin == "B01HJH40WU") | (no_NA_reviews.asin == "B01HJF704M") | 
                           (no_NA_reviews.asin == "B01HJCN5GC") | (no_NA_reviews.asin == "B01HJCN5TO") |
                           (no_NA_reviews.asin == "B01HJDNL60") | (no_NA_reviews.asin == "B01HJDR9DQ") |
                           (no_NA_reviews.asin == "B01HJFFHTC") | (no_NA_reviews.asin == "B01HJCN1EI")]


In [None]:
asin_subset.to_csv(r'C:\Users\jrmah\Desktop\datasci-w266-finalProject\asin_subset.csv')
#no_NA_reviews.tail()

In [2]:

#df_mini = pd.read_csv("df_mini.csv")
asin_subset = pd.read_csv("asin_subset.csv")
asin_subset.shape
asin_subset.head()

Unnamed: 0.1,Unnamed: 0,asin,image,overall,reviewText,reviewerID,reviewerName,style,summary,unixReviewTime,verified,vote
0,6099679,B01HJCN5GC,,5,Great buy!,ATGTQKPUR7XIO,Arthur,,Five Stars,10-09-2016,True,
1,6099680,B01HJCN5GC,,5,Works very well and we have lots (& lots) of e...,A15VV7NPTST593,Randy T.,,Extend your reach with ease,09-07-2016,True,
2,6099681,B01HJCN5GC,,5,This cable is very flexible. Just what I wanted.,AIM3MWK3Y7XOR,Kindle Customer,,Flexible cable,03-14-2017,True,
3,6099682,B01HJCN5GC,,5,"These are the best charging cables, and if oth...",A5W6EI03IKOLB,P.Davidson,,Best cables,02-15-2017,True,
4,6099683,B01HJCN5GC,,4,I bought this in rose gold or light pink and i...,A3QZTMHQ1XZ8PM,glittergirl,,super cute cord,02-13-2017,True,


In [3]:
#print(df_mini.shape)
#x_train = df_mini
x_train = asin_subset

### Text Encoding using Universal Sentence Encoder

In the subsequent code cells, we load the Universal Sentence Encoder (USE), break the data into training and testing data, and apply the USE to the data. 

In [None]:
!pip3 uninstall tensorflow-gpu
!pip3 uninstall tensorflow

In [None]:
# Remove ## from lines starting with ! and run them the first time to install necessary packages 

##%%capture
# Install the Tensorflow 2.0.0 version.
!pip3 install tensorflow==2.0.0
# Install TF-Hub.
!pip3 install tensorflow-hub
!pip3 install seaborn


In [1]:
#@title Load the Universal Sentence Encoder's TF Hub module
from absl import logging

import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [4]:
#create embeddings on the training data 
logging.set_verbosity(logging.ERROR)
#message_embeddings = embed(x_train.reviewText)
message_embeddings = embed(x_train.reviewText)

In [5]:
print("Training X Shape", x_train.shape)
#print("Testing X Shape", x_test.shape)

Training X Shape (151, 12)


In [6]:
x_train.head()

Unnamed: 0.1,Unnamed: 0,asin,image,overall,reviewText,reviewerID,reviewerName,style,summary,unixReviewTime,verified,vote
0,6099679,B01HJCN5GC,,5,Great buy!,ATGTQKPUR7XIO,Arthur,,Five Stars,10-09-2016,True,
1,6099680,B01HJCN5GC,,5,Works very well and we have lots (& lots) of e...,A15VV7NPTST593,Randy T.,,Extend your reach with ease,09-07-2016,True,
2,6099681,B01HJCN5GC,,5,This cable is very flexible. Just what I wanted.,AIM3MWK3Y7XOR,Kindle Customer,,Flexible cable,03-14-2017,True,
3,6099682,B01HJCN5GC,,5,"These are the best charging cables, and if oth...",A5W6EI03IKOLB,P.Davidson,,Best cables,02-15-2017,True,
4,6099683,B01HJCN5GC,,4,I bought this in rose gold or light pink and i...,A3QZTMHQ1XZ8PM,glittergirl,,super cute cord,02-13-2017,True,


In [12]:
message_embeddings[0]

<tf.Tensor: id=5310, shape=(512,), dtype=float32, numpy=
array([ 0.00710741, -0.09182499,  0.02211771,  0.02918162,  0.03248031,
       -0.00574995,  0.01550163,  0.01271848, -0.03912225, -0.04936445,
       -0.04106535, -0.03063094, -0.04944001,  0.07665452, -0.02052558,
        0.09214974,  0.02414197, -0.02064327,  0.05531691, -0.01879737,
       -0.05702349, -0.00586061,  0.01588081, -0.03555921, -0.01498751,
        0.00399226,  0.01141417, -0.0243075 , -0.03564594, -0.01301201,
        0.04049183, -0.02411711,  0.03397733,  0.0012069 ,  0.04055983,
        0.04646035,  0.02682724,  0.03999465, -0.00665502, -0.01069987,
       -0.01126873,  0.05030269, -0.02923088, -0.05813222, -0.03194809,
       -0.0402074 , -0.04014372, -0.04637567, -0.04457714,  0.07127845,
        0.06020061,  0.00120793,  0.04956914,  0.07571479,  0.02649621,
       -0.00352104, -0.01519211, -0.03678934,  0.07496415, -0.00524928,
        0.03508531, -0.00150366, -0.00683645,  0.00260571,  0.03070576,
       

## Stanford POS Tagger to Find Product Attributes

We use the Stanford POS tagger to find the most common nouns used in product reviews for each product ID (ASIN). Then we use the most common nouns as product attributes. 

In [None]:
#!python -m pip install --upgrade pip
#!pip install torch
#!pip install stanfordnlp

In [8]:
# need to install java (unless you already have it installed) 
# and update the path to where ever it is stored on your computer
import os
java_path = "C:/Program Files/Java/jre1.8.0_241/bin/java.exe"
os.environ['JAVAHOME'] = java_path

# need to follow instructions to install Stanford POS tagger here: 
# https://phitchuria.wordpress.com/2018/09/29/python-nltk-using-stanford-pos-tagger-in-nltk-on-windows/
from nltk.tag import StanfordPOSTagger
from nltk.corpus import stopwords
stanford_dir = "C:\Stanford\stanford-postagger-2018-10-16"
modelfile = stanford_dir+"\models\english-bidirectional-distsim.tagger"
jarfile=stanford_dir+"\stanford-postagger.jar"

tagger=StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)

In [9]:
freq_dist={}
for i in range(1,len(x_train)): 
#for i in range(1,10): 
    tagged_POS = tagger.tag(x_train.reviewText[i].split())
    for word,tag in tagged_POS:
        if tag == 'NN' or tag == 'NNS':
            if word in freq_dist:
                freq_dist[word] += 1
            else:
                freq_dist[word] = 1


In [10]:
import operator
sorted_freq_dist=sorted(freq_dist.items(),key=operator.itemgetter(1))
# change into the dictionary since it is easier to approach
dict_sorted_freq_dist=dict(sorted_freq_dist)

print(dict_sorted_freq_dist)

{'lots': 1, 'extension.': 1, 'family': 1, 'members': 1, 'rose': 1, 'gold': 1, 'bubblegum': 1, 'issues': 1, 'pop': 1, 'staying': 1, 'cabel.': 1, 'cording': 1, 'wires': 1, 'breaking,': 1, 'too.': 1, 'purchased!': 1, "Can't": 1, 'bump': 1, 'stiff': 1, 'replacement.': 1, 'better.': 1, 'either.': 1, 'green': 1, 'ladies': 1, 'places.': 1, 'fear': 1, 'bend': 1, 'bleep': 1, 'connection.': 1, '(hear)': 1, "you'": 1, '(Kindle)': 1, 'a.m.': 1, 'morning.': 1, 'thanks': 1, 'fits': 1, 'tablet': 1, 'battery.': 1, 'room': 1, 'soft,': 1, 'deform....bought': 1, 'case....I': 1, 'reviews': 1, 'pay': 1, 'attention....one': 1, 'functioning.....the': 1, 'position...it': 1, 'fire....I': 1, 'package': 1, 'green,': 1, 'charger!': 1, 'Camera,': 1, 'Im': 1, 'GoPro.': 1, '1080P': 1, 'HDR,': 1, 'par': 1, 'resolution,': 1, 'video.': 1, 'there,': 1, 'operation.': 1, 'NT96650': 1, 'AR0330': 1, 'CMOS': 1, 'sizes': 1, 'monitor': 1, 'systems.': 1, 'R2,': 1, 'Mini': 1, 'look': 1, 'clips': 1, 'sure.': 1, 'Folder': 1, 'fold

In [13]:
print(freq_dist[0])

KeyError: 0