# Hotel Review Sentiment Analysis Part 3: Prediction
## Adura ABIONA, PhD (UNSW)
### 4 May, 2017

## Introduction

This is the **Part 3** of the **Hotel Review Sentiment Analysis** of Australian hotels, from four major cities (Canberra, Sydney, Melbourne and Brisbane), based on reviewers' opionions (on a numerical scale of 1-5) from [**TripAdvisor**](http://www.tripadvisor.com.au) website. This part is focused on **Prediction** of the reviews. 

In [1]:
import glob, os, string # os.chdir()
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib
import matplotlib.pyplot as plt #import matplotlib as mpl
matplotlib.style.use('ggplot')
%matplotlib inline 

from sklearn import preprocessing
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import *
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, precision_score, recall_score, roc_auc_score
import time

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from collections import Counter
from nltk import word_tokenize
from nltk.util import bigrams, trigrams
sep = "~"
DataDir = "Datasets/"
#nltk.download() # download the english stopwords corpus and the punkt package and maybe the porter stemmer if not present

pd.set_option('display.max_columns', 36)
print(pd.__version__)

0.20.1


#### The block of code below reads the review details for the hotels from the 4 major cities in Australia into a dataframe.

In [2]:
review_feats = ['id', 'title','body','rati','value','locat','sleep','rooms','clean','servi','other']
citys = ['Canberra', 'Sydney', 'Melbourne', 'Brisbane']
review_df = pd.DataFrame() #creates a new dataframe that's empty
for city in citys:
    citydir = os.path.join(os.getcwd(), DataDir + city)
    for file in glob.glob(os.path.join(citydir,"*-review.mcsv")): 
        review_df = review_df.append(pd.read_csv(os.path.join(citydir, file), sep=sep, header=None, names = review_feats), ignore_index=True)

print(review_df.shape)
review_df.head()

(27868, 11)


Unnamed: 0,id,title,body,rati,value,locat,sleep,rooms,clean,servi,other
0,review_478470647,“Waste of money”,"From the moment we walked into the Adobe,we kn...",1,1.0,,,2.0,,1.0,
1,review_476951438,“Well appointed room”,On check in was a queue of 6 waiting with only...,3,,,,,,,
2,review_476646112,"“Forgotten property ""vanished into thin air"" a...",Review submitted on behalf of my wife who was ...,2,4.0,,,4.0,,2.0,
3,review_475716850,“Everything you need and more”,This is a great hotel. Clean and really comfor...,5,,,,,,,
4,review_474490948,"“Super handy for shops, food and transport”",This is my go to accommodation in Canberra. Cl...,5,,,,,,,


### Having a looking at some of the reviews before being processed

In [6]:
review_dfx = review_df[['rati', 'body']]
for idx in range(5):
    print(review_dfx.body[idx])
    print("++++++++++++++++++++++++++++++++++++++++\n") 

From the moment we walked into the Adobe,we knew it was a mistake. Two young women behind the counter looked at us like we were scum and never broken a smile. The rooms are small and out dated for the money paid. No secure parking. You park in a general car park next to a bus depot. We barely slept, one because the beds are hard and because we were concerned for our new car being parked out for all to get to.My husband and I were in town from Newcastle to see a very sick friend for the weekend. Staying here just made the whole weekend even more stressful. Just horrible. We will not be back.
++++++++++++++++++++++++++++++++++++++++

On check in was a queue of 6 waiting with only one person on reception- room well appointed- mine had no view - bed comfortable- shower wasn't great pressure - breakfast was great hollandaise sauce was perfect 
++++++++++++++++++++++++++++++++++++++++

Review submitted on behalf of my wife who was the staying guest on 29 March 2017: The room was comfortable 

### Processing the reviews by removing stopwords, punctuations, stemming and tokenizing.

In [7]:
#Some words (e.g. no, not, more, most etc.) have been removed from the standard stopwords available in NLTK. 
#It’s done so because those words can have some sentiment impact in our review dataset.
#nltk.download() # download the english stopwords corpus and the punkt package and maybe the porter stemmer if not present

custom_stopwords = set(stopwords.words('english') + ["n't", "'ve", "'s", "'m", "ca"] + list(ENGLISH_STOP_WORDS) 
                 + ['canberra', 'sydney', 'melbourne', 'brisbane']) - set(('over', 'under', 'below', 'more', 
                    'most', 'no', 'not', 'only', 'such', 'few', 'so', 'too', 'very', 'just', 'any', 'once'))


def restring_tokens(token_list):
    return ' '.join(token_list)

def preprocess(dfSent):
    sentx = str(dfSent)
    for dg in string.digits:  sentx = sentx.replace(dg, " ") 
    for ch in string.punctuation:  sentx = sentx.replace(ch, " ") 
    sentx = sentx.strip().replace("\n", " ").replace("\r", " ")
    sentx = sentx.lower()    
    wordList = [word for word in sentx.split() if word not in custom_stopwords] # Given a list of words, remove any that are in a list of stop words.
    sentx = ' '.join(wordList)
    token_list = nltk.word_tokenize(sentx) 
    STEMMER = PorterStemmer()
    token_list = [STEMMER.stem(tok) for tok in token_list]
    return restring_tokens(token_list)

def remove_stopwords(s):
    token_list = nltk.word_tokenize(s)
    exclude_stopwords = lambda token : token not in NLTK_STOPWORDS
    return ' '.join(filter(exclude_stopwords, token_list))

def filter_out_more_stopwords(token_list):
    return filter(lambda tok : tok not in MORE_STOPWORDS, token_list)

def processTokens(s):
    s = s.translate(None, string.digits)
    s = s.lower()
    s = s.translate(None, string.punctuation)
    s = remove_stopwords(s)
    token_list = nltk.word_tokenize(s)
    token_list = filter_out_more_stopwords(token_list)
    STEMMER = PorterStemmer()
    token_list = [STEMMER.stem(tok.decode('utf-8')) for tok in token_list]
    return restring_tokens(token_list)

### Having a looking at some of the reviews after being processed

In [10]:
review_dfy = review_df
%time review_dfy['body'] = review_dfy['body'].apply(preprocess)

for idx in range(5):
    print()
    print(review_dfy.body[idx])    

CPU times: user 3min 32s, sys: 938 ms, total: 3min 33s
Wall time: 3min 36s

moment walk adob knew mistak young women counter look like scum broken smile room small date money paid no secur park park gener car park bu depot bare slept bed hard concern new car park husband town newcastl veri sick friend weekend stay just weekend more stress just horribl not

check queue wait onli person recept room appoint no view bed comfort shower great pressur breakfast great hollandais sauc perfect

review submit behalf wife stay guest march room comfort pleasant stay unfortun entir sour left seagat tb portabl usb hard drive check use hard drive previou day usb cabl attach whilst pack set asid bed purpos bought carri case intend pack case put hand luggag arriv home carri case hand luggag immedi realis left portabl hard drive attach usb cabl bed room not arriv home late even so phone abod hotel thing morn advis search later not contact hour later phone inform no portabl hard drive not room veri upset 

In [11]:
review_dfz = review_df
%time review_dfz['body'] = review_dfz['body'].apply(preprocess)

for idx in range(5):
    print()
    print(review_dfz.body[idx])    

CPU times: user 3min 4s, sys: 2.92 s, total: 3min 7s
Wall time: 3min 8s

moment walk adob knew mistak young women counter look like scum broken smile room small date money paid no secur park park gener car park bu depot bare slept bed hard concern new car park husband town newcastl veri sick friend weekend stay just weekend more stress just horribl not

check queue wait onli person recept room appoint no view bed comfort shower great pressur breakfast great hollandai sauc perfect

review submit behalf wife stay guest march room comfort pleasant stay unfortun entir sour left seagat tb portabl usb hard drive check use hard drive previou day usb cabl attach whilst pack set asid bed purpo bought carri case intend pack case hand luggag arriv home carri case hand luggag immedi reali left portabl hard drive attach usb cabl bed room not arriv home late so phone abod hotel thing morn advi search later not contact hour later phone inform no portabl hard drive not room veri upset no doubt left ma

In [12]:
#Split data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(review_dfz.body, review_dfz.rati, test_size=0.3, random_state=123)

In [13]:
uniVect = CountVectorizer(analyzer = "word",  tokenizer = None, preprocessor = None, 
                         ngram_range = (1, 1), binary = False, strip_accents='unicode')

biVect = CountVectorizer(analyzer = "word",  tokenizer = None, preprocessor = None, ngram_range = (2, 2), strip_accents='unicode')

triVect = CountVectorizer(analyzer = "word",  tokenizer = None, preprocessor = None, ngram_range = (3, 3), strip_accents='unicode')


bi_triVect = CountVectorizer(analyzer = "word",  tokenizer = None, preprocessor = None, ngram_range = (2, 3), strip_accents='unicode')

rfVect = CountVectorizer(analyzer = "word",  tokenizer = None, preprocessor = None, ngram_range = (2, 2), 
                         strip_accents='unicode', max_features = 1000)

## Visualization through Confusion Matrix

In [14]:
def make_confusion_matrix_relative(confusion_matrix):
    star_category_classes = [1, 2, 3, 4, 5]
    N = list(map(lambda clazz : sum(Y_test == clazz), star_category_classes))
    relative_confusion_matrix = np.empty((len(star_category_classes), len(star_category_classes)))
    
    for j in range(0, len(star_category_classes)):
        if N[j] > 0:
            relative_frequency = confusion_matrix[j, :] / float(N[j])
            relative_confusion_matrix[j, :] = relative_frequency
            
    return relative_confusion_matrix

def plot_confusion_matrix(confusion_matrix=[[]], title='CM', savefilename=''):
    rcm = make_confusion_matrix_relative(confusion_matrix)
    #plt.imshow(rcm, vmin=0, vmax=1, interpolation='nearest')
    c = plt.pcolor(rcm, edgecolors='k', linewidths=4, cmap='jet', vmin=0.0, vmax=1.0)
    plt.title(title)
    plt.colorbar()
    plt.ylabel('Actual Label')
    plt.xlabel('Predicted Label')
    plt.xticks(0.5 + np.arange(5), np.arange(1,6))
    plt.yticks(0.5 + np.arange(5), np.arange(1,6))

    def show_values(pc, fmt="%.2f", **kw):
        #from itertools import zip
        pc.update_scalarmappable()
        ax = pc.get_axes()
        for p, color, value in zip(pc.get_paths(), pc.get_facecolors(), pc.get_array()):
            x, y = p.vertices[:-2, :].mean(0)
            if sum(color[:2] > 0.3) >= 2:
                color = (0.0, 0.0, 0.0)
            else:
                color = (1.0, 1.0, 1.0)
            ax.text(x, y, fmt % value, ha="center", va="center", color=color, **kw)
    
    show_values(c)
    if savefilename:
        plt.savefig(savefilename, bbox_inches='tight')
    
    return plt.show()

def print_classifier_performance_metrics(name, predictions):
    target_names = ['1 star', '2 star', '3 star', '4 star', '5 star']
    
    print ("MODEL: %s" % name)
    print ()

    print ('Precision: ' + str(metrics.precision_score(Y_test, predictions, average='micro')))
    print ('Recall: ' + str(metrics.recall_score(Y_test, predictions, average='micro')))
    print ('F1: ' + str(metrics.f1_score(Y_test, predictions,  average='micro')))
    print ('Accuracy: ' + str(metrics.accuracy_score(Y_test, predictions)))

    print()
    print ('Classification Report:')
    print (classification_report(Y_test, predictions, target_names=target_names))
    
    print()
    print ('Precision variance: %f' % np.var(precision_score(Y_test, predictions, average=None), ddof=len(target_names)-1))
    
    print()
    print ('Recall variance: %f' % np.var(recall_score(Y_test, predictions, average=None), ddof=len(target_names)-1))

### Bag of Words Model
#### Transforming the hotel reviews into feature vectors

In [15]:
uniTrain = uniVect.fit_transform(X_train)
uniTest = uniVect.transform(X_test)
uniTrain, uniTest

#Make predictions with Unigram Multinomial NB
uniNb_classifier = MultinomialNB()
uniNb_classifier.fit(uniTrain, Y_train)
uniNb_prediction = uniNb_classifier.predict(uniTest)

uniNbConfMatrix = confusion_matrix(Y_test, uniNb_prediction)
print (make_confusion_matrix_relative(uniNbConfMatrix))
plot_confusion_matrix(uniNbConfMatrix, 'Multinomial Naive Bayes Confusion Matrix', savefilename='MultinomialCM.png')

ValueError: Unknown label type: (array([2.0, 4, '5', ..., 5, 4, 5], dtype=object),)

In [None]:
uniTrain
Y_train

In [None]:
print_classifier_performance_metrics('Multinomial Naive Bayes', uniNb_prediction)

### Bigram Naive Bayes Model
#### Transform hotel reviews into feature vectorizers by counting bigram occurrences

In [None]:
biTrain = biVect.fit_transform(X_train)
biTest = biVect.transform(X_test)
biTrain, biTest

#Make predictions with Bigram Multinomial NB
bigram_multinomial_nb_classifier = MultinomialNB().fit(bigram_multinomial_feature_matrix_train, Y_train)
bigram_multinomial_nb_prediction = bigram_multinomial_nb_classifier.predict(bigram_multinomial_feature_matrix_test)

#Visualize through confusion matrix
bigram_multinomial_confusion_matrix = confusion_matrix(Y_test, bigram_multinomial_nb_prediction)
print (make_confusion_matrix_relative(multinomial_confusion_matrix))
plot_confusion_matrix(bigram_multinomial_confusion_matrix, 'Bigram Multinomial Naive Bayes Confusion Matrix', savefilename='BigramMultinomialCM.png')

In [None]:
print_classifier_performance_metrics('Bigram Multinomial Naive Bayes', bigram_multinomial_nb_prediction)

### Trigram Naive Bayes Model

#### Transform hotel reviews into feature vectorizers by counting trigram occurrences


In [None]:
trigram_multinomial_feature_matrix_train = trigram_vectorizer.fit_transform(X_train)
trigram_multinomial_feature_matrix_test = trigram_vectorizer.transform(X_test)
#trigram_multinomial_feature_matrix_train, trigram_multinomial_feature_matrix_test

#Make predictions with Trigram Multinomial NB
tri_gram_multinomial_nb_classifier = MultinomialNB().fit(trigram_multinomial_feature_matrix_train, Y_train)
tri_gram_multinomial_nb_prediction = tri_gram_multinomial_nb_classifier.predict(trigram_multinomial_feature_matrix_test)

#Visualize through confusion matrix
trigram_multinomial_confusion_matrix = confusion_matrix(Y_test, tri_gram_multinomial_nb_prediction)
plot_confusion_matrix(trigram_multinomial_confusion_matrix, 'Trigram Multinomial Naive Bayes Confusion Matrix', savefilename='TrigramMultinomialCM.png')


In [None]:
print_classifier_performance_metrics('Trigram Multinomial Naive Bayes', tri_gram_multinomial_nb_prediction)

### Random Forest 100 Learners Model

In [None]:
forest100 = RandomForestClassifier(n_estimators = 100, random_state=42)

#Transform Yelp reviews into feature vectors
random_forest_feature_matrix_train = random_forest_vectorizer.fit_transform(X_train)
random_forest_feature_matrix_test = random_forest_vectorizer.transform(X_test)

#Make predictions with random forest set at 100 learners
%time forest100.fit(random_forest_feature_matrix_train.toarray(), Y_train)
forest100_pred = forest100.predict(random_forest_feature_matrix_test.toarray())
np.save('forest100pred', forest100_pred)

#Visualize results in confusion matrix
random_forest_confusion_matrix = confusion_matrix(Y_test, forest100_pred)
plot_confusion_matrix(random_forest_confusion_matrix, 'Random Forest (100 Learners) Confusion Matrix', savefilename='RandomForestCM.png')


In [None]:
print_classifier_performance_metrics('Random Forest (100 Learners)', forest100_pred)