# W266 Final Project Code
# Amazon Product Review Aspect-Based Sentiment
## Jennifer Mahle and Joanna Wang (Sections 3 and 1, respectively) 

#### Introduction
For our final project, we built a classification system for Amazon product reviews. The system categorizes product reviews into various classes of what the review focuses on, then determines whether the review is positive or negative for a given product trait (ie durability, quality, etc). As a user, star ratings alone might not give enough information about the product, so reading the reviews still is the best way to determine if the product fits the user’s needs. The challenge is, sometimes there can be hundreds of reviews for a product and users cannot spend time reading all of them.  So we want to provide this classification system to reduce the review reading process and help the users to find what they need. 


### Exploratory Data Analysis

In this section, we load, clean, and explore the data. We are using Amazon product reviews for electronics from the website https://nijianmo.github.io/amazon/index.html

In [2]:
#Import packages 
# Importing libraries
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

import os
import sys

In [5]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.layers import Dropout
# fix random seed for reproducibility
np.random.seed(7)

ModuleNotFoundError: No module named 'keras'

In [4]:
!pip3 install --user keras

[33mYou are using pip version 19.0.3, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
#DON'T NEED TO RUN THIS PART FOR NOW. RUN THE NEXT CELL TO LOAD DATA
#####################################################################
dataset = "Electronics_5.json"

if os.path.isfile(dataset):
    df = pd.read_json("Electronics_5.json", lines=True)
else:
    url = r"http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz"
    df = pd.read_json(url, compression='gzip', lines=True)

display(df.tail(10))
df.shape
print(df.info())
df_mini = df[(df.asin == "B01HJCN1EI") | (df.asin == "B01HJH42KU") | 
                            (df.asin == "B01HJH40WU") | (df.asin == "B01HJF704M") | 
                           (df.asin == "B01HJCN5GC") | (df.asin == "B01HJCN5TO") |
                           (df.asin == "B01HJDNL60") | (df.asin == "B01HJDR9DQ") |
                           (df.asin == "B01HJFFHTC") | (df.asin == "B01HJCN1EI")]
df_mini.shape
df_mini.to_csv('/home/wangjia/datasci-w266-finalProject/df_mini.csv')
######################################################################

In [None]:
df = pd.read_csv("df_mini.csv") 
display(df.tail(10))
df.shape
print(df.info())

In [None]:
#Remove NA review rows
df = df.dropna(subset=['reviewText'])
#Checking one of the reviews
print(df["reviewText"].iloc[100])

In [None]:
# Downloading stopwords
nltk.download('stopwords')

#set of stopwords in English
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
words_to_keep = set(('not'))
stop -= words_to_keep
#initialising the snowball stemmer
sno = nltk.stem.SnowballStemmer('english')

#function to clean the word of any html-tags
def cleanhtml(sentence):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext

#function to clean the word of any punctuation or special characters
def cleanpunc(sentence): 
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned

In [None]:
#Code for removing HTML tags , punctuations . Code for removing stopwords . Code for checking if word is not alphanumeric and
# also greater than 2 . Code for stemmimg and also to convert them to lowercase letters 

i=0
str1=' '
final_string=[]
all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.
s=''
for sent in df.reviewText:
    filtered_sentence=[]
    #print(sent);
    sent=cleanhtml(sent) # remove HTMl tags
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop):
                    s=(sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (df['reviewText'].values)[i] == 1: 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(df['reviewText'].values)[i] == 0:
                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                else:
                    continue
            else:
                continue 
    
    str1 = b" ".join(filtered_sentence) #final string of cleaned words
    
    
    final_string.append(str1)
    i+=1

In [None]:
#adding a column of CleanedText which displays the data after pre-processing of the review
df['CleanedText']=final_string  
df['CleanedText']=df['CleanedText'].str.decode("utf-8")
#below the processed review can be seen in the CleanedText Column 
print('Shape of final',df.shape)
df.head()

In [None]:
#After processing the sample review looks like this
print(df["CleanedText"].iloc[100])

In [None]:
#Sorting data according to asin in ascending order
sorted_data=df.sort_values('asin', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"reviewerID","reviewerName","reviewText","summary"}, keep='first', inplace=False)

#Removed not verified rows
final = final[final.verified != False]

#Drop NA and name it to x_train 
x_train = final.dropna(subset=['reviewText'])
print(x_train.shape)

### Text Encoding using Universal Sentence Encoder

In the subsequent code cells, we load the Universal Sentence Encoder (USE), break the data into training and testing data, and apply the USE to the data. 

In [None]:
!pip3 uninstall tensorflow-gpu
!pip3 uninstall tensorflow

In [None]:
# Remove ## from lines starting with ! and run them the first time to install necessary packages 

##%%capture
# Install the Tensorflow 2.0.0 version.
!pip3 install --user tensorflow==2.0.0
# Install TF-Hub.
!pip3 install --user tensorflow-hub
!pip3 install --user seaborn


In [None]:
#@title Load the Universal Sentence Encoder's TF Hub module
from absl import logging

import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

In [None]:
#create embeddings on the training data 
logging.set_verbosity(logging.ERROR)
message_embeddings = embed(x_train.CleanedText)

In [None]:
print("Training X Shape", x_train.shape)

In [None]:
x_train.head()

In [None]:
message_embeddings[10]

## Model creation

In [None]:
!pip3 install --user keras

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.layers import Dropout
# fix random seed for reproducibility
np.random.seed(7)

In [None]:
# create the model
embedding_vecor_length = 32

# Initialising the model
model_1 = Sequential()

# Adding embedding
model_1.add(Embedding(len(message_embeddings), embedding_vecor_length, input_length=max_review_length))

## Stanford POS Tagger to Find Product Attributes

We use the Stanford POS tagger to find the most common nouns used in product reviews for each product ID (ASIN). Then we use the most common nouns as product attributes. 

In [None]:
#!python -m pip install --upgrade pip
#!pip install torch
#!pip install stanfordnlp

In [None]:
# need to install java (unless you already have it installed) 
# and update the path to where ever it is stored on your computer
import os
java_path = "C:/Program Files/Java/jre1.8.0_241/bin/java.exe"
os.environ['JAVAHOME'] = java_path

# need to follow instructions to install Stanford POS tagger here: 
# https://phitchuria.wordpress.com/2018/09/29/python-nltk-using-stanford-pos-tagger-in-nltk-on-windows/
from nltk.tag import StanfordPOSTagger
from nltk.corpus import stopwords
stanford_dir = "C:\Stanford\stanford-postagger-2018-10-16"
modelfile = stanford_dir+"\models\english-bidirectional-distsim.tagger"
jarfile=stanford_dir+"\stanford-postagger.jar"

tagger=StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)

In [None]:
freq_dist={}
for i in range(1,len(x_train)): 
#for i in range(1,10): 
    tagged_POS = tagger.tag(x_train.reviewText[i].split())
    for word,tag in tagged_POS:
        if tag == 'NN' or tag == 'NNS':
            if word in freq_dist:
                freq_dist[word] += 1
            else:
                freq_dist[word] = 1


In [None]:
import operator
sorted_freq_dist=sorted(freq_dist.items(),key=operator.itemgetter(1))
# change into the dictionary since it is easier to approach
dict_sorted_freq_dist=dict(sorted_freq_dist)

print(dict_sorted_freq_dist)

In [None]:
print(freq_dist[0])